-
Notifications
You must be signed in to change notification settings - Fork 66
🐛 OPRUN-4110: Restart when SystemCertPool should change #2175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2175 +/- ##
==========================================
+ Coverage 72.71% 72.78% +0.06%
==========================================
Files 79 79
Lines 7378 7447 +69
==========================================
+ Hits 5365 5420 +55
- Misses 1666 1679 +13
- Partials 347 348 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: camilamacedo86 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
New changes are detected. LGTM label has been removed. |
I think I need to add certpoolwatch support for the |
d88192c
to
b645ba4
Compare
func (cpw *CertPoolWatcher) Done() { | ||
cpw.done <- true | ||
// Change the restart behavior | ||
// The default is to os.Exit(0) when a change to SSL_CERT_DIR or SSL_CERT_FILE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer if the restart func would be a pointer, and default to nil. That way, it would be much more explicit when servicing its use-case, since the change of behavior to do os.Exit(0) would be associated with that area of the code.
At this point I'm aware of one use-case serviced by this functionality, so it wouldn't lead to code duplication. This is evident in the cpw.checkForRestart function, where we only check for conditions associated with the intially-unavailable-so-permanently-un-discoverable CA cert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe a func
is considered a pointer, and defaults to nil
, so need to change the type, just need to change where it is assigned.
But I did put the default to effectively be os.Exit(0)
, so that I don't have to check the pointer, but it's only one place.
So, you are suggesting the assignment occur where the CertPoolWatcher
is created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, then the concern & action are co-located, which makes it easier to audit.
Confirmed with @tmshort that this comment is no longer valid. |
Yeah, that comment was a comment to myself late at night to check something the next day, as I was finishing for the night. I added an EDIT. |
This fixes a downstream bug There was a problem downstream where the OpenShift servivce-ca was not yet available, and due to the way the manifests were set up, the service-ca was considered to be part of the SystemCertPool. The problem is that the SystemCertPool, once initialized, will never reload itself. We can get into this situation when we use SSL_CERT_DIR and SSL_CERT_FILE to provide OpenShift CAs to be used by containers/image for pulling. These environment variables change the source of the SystemCertPool. The CertPoolWatcher then watches these locations, and tries to update the pool it provides to the HTTPS client connecting to catalogd. But the SystemCertPool is never updated. (It did not help that there was no explicit CertPoolWatcher for the pull CAs.) I tried to fix this downstream by removing SSL_CERT_DIR, and specifying the `--pull-cas-dir` option. This means that containers/image would directly use certificates that we specify, rather than the default location. But this breaks the use of custom CAs for local image registries. The containers/image package does not provide a way to manipulate the certificate locations beyond a simple directory setting, and we need to leave that directory setting as the default in downstream because it (i.e. /etc/docker/certs.d) is a host- mounted directory that contains certificates for local image registries. And it is possible to configure a custom CA for a local image registry, so that directory must be included, ALONG with the OpenShift provided CAs and service-ca, which is defined by SSL_CERT_DIR. But because of the use of SSL_CERT_DIR to include the OpenShift service-ca, if the service-ca was not available at startup, but became available later, it was not possible to reload the SystemCertPool. Which could cause problems in operator-controller when it tried to connect to catalogd. The fundamental problem is that there's no way to refresh the SystemCertPool, which will become more and more of an issue as certificate lifetimes decrease. Using SSL_CERT_DIR allows us to use the CertPoolWatcher to notice changes to the SystemCertPool. This will allow us to restart the process when certificates change (e.g. OpenShift service-ca becomes available). Changes: * Update CertPoolWatcher to restart on changes to SSL_CERT_DIR and SSL_CERT_FILE * Update CertPoolWatcher to use a Runnable interface, so that it can be added to the manager, and started later, which may improve the changes that the service-ca is ready. * Update CertPoolWatcher to not be created when there's nothing to watch. * Add CertPoolWatcher to catalogd for pull CAs * Add CertPoolWatcher to operator-controller for pull CAs * Improve logging With this, my downstream manifest change should be reverted. Signed-off-by: Todd Short <[email protected]> Assisted-by: Claude Code (alternatives)
This fixes a downstream bug
There was a problem downstream where the OpenShift servivce-ca was not yet available, and due to the way the manifests were set up, the service-ca was considered to be part of the SystemCertPool. The problem is that the SystemCertPool, once initialized, will never reload itself.
We can get into this situation when we use SSL_CERT_DIR and SSL_CERT_FILE to provide OpenShift CAs to be used by containers/image for pulling. These environment variables change the source of the SystemCertPool. The CertPoolWatcher then watches these locations, and tries to update the pool it provides to the HTTPS client connecting to
catalogd. But the SystemCertPool is never updated. (It did not help that there was no explicit CertPoolWatcher for the pull CAs.)
I tried to fix this downstream by removing SSL_CERT_DIR, and specifying the
--pull-cas-dir
option. This means that containers/image would directly use certificates that we specify, rather than the default location.But this breaks the use of custom CAs for local image registries.
The containers/image package does not provide a way to manipulate the certificate locations beyond a simple directory setting, and we need to leave that directory setting as the default in downstream because it (i.e. /etc/docker/certs.d) is a host-mounted directory that contains certificates for local image registries. And it is possible to configure a custom CA for a local image registry, so that directory must be included, ALONG with the OpenShift provided CAs and service-ca, which is defined by SSL_CERT_DIR.
But because of the use of SSL_CERT_DIR to include the OpenShift service-ca, if the service-ca was not available at startup, but became available later, it was not possible to reload the SystemCertPool. Which could cause problems in operator-controller when it tried to connect to catalogd.
The fundamental problem is that there's no way to refresh the SystemCertPool, which will become more and more of an issue as certificate lifetimes decrease.
Using SSL_CERT_DIR allows us to use the CertPoolWatcher to notice changes to the SystemCertPool. This will allow us to restart the process when certificates change (e.g. OpenShift service-ca becomes available).
Changes:
With this, my downstream manifest change should be reverted.
Signed-off-by: Todd Short [email protected]
Assisted-by: Claude Code (alternatives)
Description
Reviewer Checklist