Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness and Readiness Probes Consistently Failing #824

Open
throwanexception opened this issue Jun 7, 2023 · 6 comments
Open

Liveness and Readiness Probes Consistently Failing #824

throwanexception opened this issue Jun 7, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@throwanexception
Copy link

Description
We're testing out the policy-controller and the Readiness and Liveness probes for the cosign-policy-controller-webhook begin to fail after an extended amount of time (~18-24 hours). Up until then the deployment appears to work correctly.

44m         Warning   Unhealthy          pod/cosign-policy-controller-webhook-bc7d858f6-49mz7   Readiness probe failed: Get "https://x:8443/readyz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
7s          Warning   BackOff            pod/cosign-policy-controller-webhook-bc7d858f6-49mz7   Back-off restarting failed container
14m         Warning   Unhealthy          pod/cosign-policy-controller-webhook-bc7d858f6-49mz7   Liveness probe failed: Get "https://x:8443/healthz": read tcp x:56308->x:8443: read: connection reset by peer
40m         Warning   Unhealthy          pod/cosign-policy-controller-webhook-bc7d858f6-m8phs   Liveness probe failed: Get "https://x:8443/healthz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

After this, the Deployment will continually crash every few minutes.

We've also noticed that we'll get errors about the image digest:
'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.

Upon retry, it will (usually) resolve the image to a digest correctly.

Our setup is using IRSA to attach the WebIdentityToken to the pod - this is natively supported by go-containerregistry so it seems to work correctly here, but unsure if it might be related or not. The pods we're pulling are from ECR so the IRSA WebIdentityToken is used to provide the permissions to access images.

The image policy we're using is a single ecdsa256 public key to verify our images, so it seems unlikely to be related.

Our clusters are quite active, especially with the constant synthetic health checking we have going, so images are being pulled frequently for end to end testing. I enabled knative debug logging by changing the configmaps for the services, but the debug output has not been helpful so far.

Any guidance or help would be appreciated!

Version
v0.7.0 of the policy-controller

@throwanexception throwanexception added the bug Something isn't working label Jun 7, 2023
@throwanexception throwanexception changed the title Liveness and Readiness Probes Consistently Failing After Extended TIme Liveness and Readiness Probes Consistently Failing Jun 7, 2023
@hectorj2f
Copy link
Collaborator

@throwanexception I'd recommend you to use our latest version, we've simplified the deployment to use a single webhook to verify you aren't experiencing the crash.

@throwanexception
Copy link
Author

@throwanexception I'd recommend you to use our latest version, we've simplified the deployment to use a single webhook to verify you aren't experiencing the crash.

Took about a week of constant usage with the 0.8.0 release with our clusters and we're seeing a similar issue as I reported for v0.7.0. The policy-controller begins to timeout readiness / liveness probes and is restarted by the scheduler. We also see the same exceptions around the image digest when this occurs. From what I can observe the memory usage is growing unbounded (possibly a leak?):

cosign-system)]$ k top pod
NAME                                 CPU(cores)   MEMORY(bytes)
policy-controller-555465fd55-g67kc   1332m        1730Mi
policy-controller-555465fd55-mmtw4   1166m        1282Mi
policy-controller-555465fd55-qpcrt   948m         1511Mi

@hectorj2f
Copy link
Collaborator

'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.

This error is supposed to happen whenever the image cannot be parsed to a digest.

Regarding the growing memory usage, I'd observe the logs to identify what is going on the controller. We're using the policy-controller in our cluster and we haven't experienced this memory growing behaviour.

The policy-controller begins to timeout readiness / liveness probes and is restarted by the scheduler

This is weird. I'd try changing the values of the liveness / readiness probes to see if it is an issue related to the growing mem or cpu consumption.

Did you see this memory growing behaviour with v0.7.0 too ?

@austinorth
Copy link

Fwiw, I'm also seeing a memory leak. v0.9.0 and v0.8.4.

@austinorth
Copy link

Yesterday, I discovered someone had set the --policy-resync-period flag to 1m. My working theory is that the in-memory cache can't handle that frequency, as the default is every 10h. 🤔 Testing reverting to the default frequency today to see if that makes a difference.

@TomWKraken
Copy link

We have also seen this issue on our busy clusters with memory usage on the policy-controller pods slowly creeping up over the course of a week. We are specifying the images by both digest and tag in our deployment. When the pods reach the memory allocation limit of the pod, we see the error:

'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.

We updated to Chart version 0.7.2 and App version 0.9.0 and still see the issue

@hectorj2f is there any work we can do to try and locate the source of this issue?

Screenshot 2024-11-29 at 15 47 07
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants