-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downtime after a caBundle until Secret propagation to pod #50
Comments
That's right, which is exactly why RestartOnSecretRefresh exists - it's probably faster to just kill the pod and let it restart |
There's probably a better way but I'm not sure what it is and this worked well enough :) |
What's the invariant that'd make restarting the pod pick up the new secret? If anything, the same container will be just restarted by the container runtime on the same host (restartpolicy) and it will still see the same volume, no? And with that, it will miss the change event and keep using the same secret. I think it would be a lot better to do something like cross signing, or expanding the CA list by keeping both old/new (prev/next) CA certs around. |
I'm not sure. In my experience restarting the pod has always worked instantly. I feel like someone told me that once but it was some time ago... Yup, serving up multiple certs would definitely be a cleaner way of doing this. Given that we set a default 10yr expiry period (IIRC), we were most concerned with the startup performance, where the original secret is effectively empty so there's zero chance of it working during the initial startup. Again, it wouldn't surprise me if cert-manager solved this in some much better way. |
In our setup we'd much rather not use cert-manager (it comes with multiple components/CRDs). I think developing a patch around keeping both CAs in the |
sgtm in all cases except for the initial startup. @maxsmythe , @ritazh , wdyt? |
+1 to having multiple bundles. Might be worth figuring out a way to gradually roll out the cert across processes too. |
+1 on supporting multiple bundles. |
Hi Folks, there is a similar issue ratify-project/ratify#821. The mTLS is required between Gatekeeper and external data provider. By default, cert-controller is used to generate and rotate Gatekeeper's webhook certificate. In our case, the user manually rotated the certificate. It seems Kubernetes took about 60-90 seconds to propagate changes to Secrets. During this period of delay, the request being sent to external data provider will fail. |
@acpana This could be interesting work. |
thanks for the tag max! I can have a look at this in my downtime from other projects. I will assign it to myself when I get to it. In the meantime, folks can feel free to jump on it if they have cycles. |
I also raised this problem a while ago. See #13 . |
I created a diagram to help me better understand the problem. sequenceDiagram
autonumber
participant client
par Update certs
cert-controller->>apiserver: Create/update Secret
cert-controller->>apiserver: Update webhook configuration
Note over kubelet: Delay before updating volume,<br>usually 30 to 100 seconds.
kubelet->>apiserver: Read updated Secret
kubelet->>volume: Write certs to /tmp/k8s-webhook-server/serving-certs
webhook-server->>volume: Read certs from /tmp/k8s-webhook-server/serving-certs
and Call webhook
client->>apiserver: Create/update resource
apiserver->>webhook-server: Submit create/update request
Note over apiserver: Clients certs are from the webhook configuration
webhook-server->>apiserver: TLS Error
Note over webhook-server: Server certs are from the volume<br>They do not (yet) match the client certs.
end
|
Based on my experimentation, it seems that the kubelet's latency to reflect the updates on a watched Secret (configMapAndSecretChangeDetectionStrategy=Watch) to a container's filesystem seems to be ranging from 30-100 seconds (i.e. not instant), regardless of minikube, kind, GKE or kubeadm clusters.
Does this basically mean that until the container that's running the webhook (and automating certificate management with
cert-controller
package), the webhook actually will be down because this library updates WebhookConfiguration's.caBundle
field with the new CA cert (which instantly takes effect) and it will no longer match the served TLS certificate for another minute or so?Is this a known issue, or something that's factored to the current design that's solved (maybe I'm seeing it incorrectly).
The text was updated successfully, but these errors were encountered: