Skip to content

Conversation

tmshort
Copy link
Contributor

@tmshort tmshort commented Aug 28, 2025

This fixes a downstream bug

There was a problem downstream where the OpenShift servivce-ca was not yet available, and due to the way the manifests were set up, the service-ca was considered to be part of the SystemCertPool. The problem is that the SystemCertPool, once initialized, will never reload itself.

We can get into this situation when we use SSL_CERT_DIR and SSL_CERT_FILE to provide OpenShift CAs to be used by containers/image for pulling. These environment variables change the source of the SystemCertPool. The CertPoolWatcher then watches these locations, and tries to update the pool it provides to the HTTPS client connecting to
catalogd. But the SystemCertPool is never updated. (It did not help that there was no explicit CertPoolWatcher for the pull CAs.)

I tried to fix this downstream by removing SSL_CERT_DIR, and specifying the --pull-cas-dir option. This means that containers/image would directly use certificates that we specify, rather than the default location.

But this breaks the use of custom CAs for local image registries.

The containers/image package does not provide a way to manipulate the certificate locations beyond a simple directory setting, and we need to leave that directory setting as the default in downstream because it (i.e. /etc/docker/certs.d) is a host-mounted directory that contains certificates for local image registries. And it is possible to configure a custom CA for a local image registry, so that directory must be included, ALONG with the OpenShift provided CAs and service-ca, which is defined by SSL_CERT_DIR.

But because of the use of SSL_CERT_DIR to include the OpenShift service-ca, if the service-ca was not available at startup, but became available later, it was not possible to reload the SystemCertPool. Which could cause problems in operator-controller when it tried to connect to catalogd.

The fundamental problem is that there's no way to refresh the SystemCertPool, which will become more and more of an issue as certificate lifetimes decrease.

Using SSL_CERT_DIR allows us to use the CertPoolWatcher to notice changes to the SystemCertPool. This will allow us to restart the process when certificates change (e.g. OpenShift service-ca becomes available).

Changes:

  • Update CertPoolWatcher to restart on changes to SSL_CERT_DIR and SSL_CERT_FILE
  • Update CertPoolWatcher to use a Runnable interface, so that it can be added to the manager, and started later, which may improve the changes that the service-ca is ready.
  • Update CertPoolWatcher to not be created when there's nothing to watch.
  • Add CertPoolWatcher to catalogd for pull CAs
  • Add CertPoolWatcher to operator-controller for pull CAs
  • Improve logging

With this, my downstream manifest change should be reverted.

Signed-off-by: Todd Short [email protected]
Assisted-by: Claude Code (alternatives)

Description

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@tmshort tmshort requested a review from a team as a code owner August 28, 2025 22:05
Copy link

netlify bot commented Aug 28, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit ff8e932
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/68b211867b11fc000829db1d
😎 Deploy Preview https://deploy-preview-2175--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci openshift-ci bot requested review from dtfranz and perdasilva August 28, 2025 22:05
Copy link

codecov bot commented Aug 28, 2025

Codecov Report

❌ Patch coverage is 68.96552% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.78%. Comparing base (182a0a6) to head (ff8e932).

Files with missing lines Patch % Lines
cmd/operator-controller/main.go 27.77% 10 Missing and 3 partials ⚠️
internal/shared/util/http/certpoolwatcher.go 87.01% 8 Missing and 2 partials ⚠️
cmd/catalogd/main.go 20.00% 6 Missing and 2 partials ⚠️
internal/shared/util/http/certlog.go 44.44% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2175      +/-   ##
==========================================
+ Coverage   72.71%   72.78%   +0.06%     
==========================================
  Files          79       79              
  Lines        7378     7447      +69     
==========================================
+ Hits         5365     5420      +55     
- Misses       1666     1679      +13     
- Partials      347      348       +1     
Flag Coverage Δ
e2e 44.15% <49.13%> (+0.05%) ⬆️
experimental-e2e 56.00% <49.13%> (-0.12%) ⬇️
unit 58.39% <61.20%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2025
Copy link

openshift-ci bot commented Aug 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: camilamacedo86
Once this PR has been reviewed and has the lgtm label, please assign perdasilva for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2025
Copy link

openshift-ci bot commented Aug 29, 2025

New changes are detected. LGTM label has been removed.

@tmshort tmshort changed the title 🐛 Restart when SystemCertPool should change WIP: 🐛 Restart when SystemCertPool should change Aug 29, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2025
@tmshort
Copy link
Contributor Author

tmshort commented Aug 29, 2025

I think I need to add certpoolwatch support for the --catalogd-cas-dir option, which is what was causing the trouble in the first place, rather than this restarting of the executable.
EDIT: It's already there, but we're missing certpoolwatch support for --pull-cas-dirs in both programs, so that was done. But also, because the certs are referenced via part of SSL_CERT_DIR, we watch them, but don't load them, as loading is done via the SystemCertPool, which as we know, cannot be updated.

@tmshort tmshort changed the title WIP: 🐛 Restart when SystemCertPool should change OPRUN-4110: 🐛 Restart when SystemCertPool should change Aug 29, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2025
@tmshort tmshort added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2025
@tmshort tmshort changed the title OPRUN-4110: 🐛 Restart when SystemCertPool should change 🐛 OPRUN-4110: Restart when SystemCertPool should change Aug 29, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2025
@tmshort tmshort added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2025
@tmshort tmshort force-pushed the restart-cert branch 2 times, most recently from d88192c to b645ba4 Compare August 29, 2025 16:10
func (cpw *CertPoolWatcher) Done() {
cpw.done <- true
// Change the restart behavior
// The default is to os.Exit(0) when a change to SSL_CERT_DIR or SSL_CERT_FILE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer if the restart func would be a pointer, and default to nil. That way, it would be much more explicit when servicing its use-case, since the change of behavior to do os.Exit(0) would be associated with that area of the code.

At this point I'm aware of one use-case serviced by this functionality, so it wouldn't lead to code duplication. This is evident in the cpw.checkForRestart function, where we only check for conditions associated with the intially-unavailable-so-permanently-un-discoverable CA cert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe a func is considered a pointer, and defaults to nil, so need to change the type, just need to change where it is assigned.
But I did put the default to effectively be os.Exit(0), so that I don't have to check the pointer, but it's only one place.

So, you are suggesting the assignment occur where the CertPoolWatcher is created?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, then the concern & action are co-located, which makes it easier to audit.

@grokspawn
Copy link
Contributor

Confirmed with @tmshort that this comment is no longer valid.

@tmshort
Copy link
Contributor Author

tmshort commented Aug 29, 2025

Confirmed with @tmshort that this comment is no longer valid.

Yeah, that comment was a comment to myself late at night to check something the next day, as I was finishing for the night. I added an EDIT.

This fixes a downstream bug

There was a problem downstream where the OpenShift servivce-ca was not yet available,
and due to the way the manifests were set up, the service-ca was considered to be
part of the SystemCertPool. The problem is that the SystemCertPool, once initialized,
will never reload itself.

We can get into this situation when we use SSL_CERT_DIR and SSL_CERT_FILE to provide
OpenShift CAs to be used by containers/image for pulling. These environment variables
change the source of the SystemCertPool. The CertPoolWatcher then watches these
locations, and tries to update the pool it provides to the HTTPS client connecting to
catalogd. But the SystemCertPool is never updated. (It did not help that there was
no explicit CertPoolWatcher for the pull CAs.)

I tried to fix this downstream by removing SSL_CERT_DIR, and specifying the
`--pull-cas-dir` option. This means that containers/image would directly use
certificates that we specify, rather than the default location.

But this breaks the use of custom CAs for local image registries.

The containers/image package does not provide a way to manipulate the certificate
locations beyond a simple directory setting, and we need to leave that directory
setting as the default in downstream because it (i.e. /etc/docker/certs.d) is a host-
mounted directory that contains certificates for local image registries. And it is
possible to configure a custom CA for a local image registry, so that directory
must be included, ALONG with the OpenShift provided CAs and service-ca, which is
defined by SSL_CERT_DIR.

But because of the use of SSL_CERT_DIR to include the OpenShift service-ca, if the
service-ca was not available at startup, but became available later, it was not
possible to reload the SystemCertPool. Which could cause problems in operator-controller
when it tried to connect to catalogd.

The fundamental problem is that there's no way to refresh the SystemCertPool,
which will become more and more of an issue as certificate lifetimes decrease.

Using SSL_CERT_DIR allows us to use the CertPoolWatcher to notice changes to the
SystemCertPool. This will allow us to restart the process when certificates
change (e.g. OpenShift service-ca becomes available).

Changes:
* Update CertPoolWatcher to restart on changes to SSL_CERT_DIR and SSL_CERT_FILE
* Update CertPoolWatcher to use a Runnable interface, so that it can be added to
the manager, and started later, which may improve the changes that the service-ca
is ready.
* Update CertPoolWatcher to not be created when there's nothing to watch.
* Add CertPoolWatcher to catalogd for pull CAs
* Add CertPoolWatcher to operator-controller for pull CAs
* Improve logging

With this, my downstream manifest change should be reverted.

Signed-off-by: Todd Short <[email protected]>
Assisted-by: Claude Code (alternatives)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants