Skip to content

Conversation

rexagod
Copy link
Member

@rexagod rexagod commented Sep 3, 2025

Add a "Monitoring" capability to support optional monitoring.

Copy link
Contributor

openshift-ci bot commented Sep 3, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2025
Copy link
Contributor

openshift-ci bot commented Sep 3, 2025

Hello @rexagod! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 3, 2025
rexagod added a commit to rexagod/cluster-version-operator that referenced this pull request Sep 3, 2025
@rexagod rexagod changed the title MON-X: Add Monitoring Capability MON-4359: Add Monitoring Capability Sep 3, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 3, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 3, 2025

@rexagod: This pull request references MON-4359 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

Add a "Monitoring" capability to support optional monitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rexagod rexagod force-pushed the opt-monitoring-capability branch 3 times, most recently from 6d757d5 to 72b018d Compare September 29, 2025 02:26
@rexagod rexagod marked this pull request as ready for review September 29, 2025 02:27
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2025

// ClusterVersionCapabilityOptionalMonitoring manages the cluster monitoring stack which is responsible for gathering and
// processing metrics from the in-house and user workloads. The following CRDs are constitute this capability:
// - TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we should not add this capability until we know the full scope of what this capability will and will not do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended.

Comment on lines 407 to 409
// WARNING: This capability will drop all aforementioned CRDs, and may operational issues in the cluster.
// The only supported use-case for this capability is to reduce the monitoring stack's resource usage by only
// supporting telemetry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What operational issues may happen here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "only supporting telemetry" mean here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added context.

Operations relying the optional components, i.e., Alertmanager, and all UWM components will be disabled. For e.g., the "Alerts" tab under "Observe" will stop showing alerting (firing, pending, etc.).

Telemetry here refers to https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md that teams can use to gather telemetry metrics about their workloads from clusters.

ClusterVersionCapabilityCloudControllerManager ClusterVersionCapability = "CloudControllerManager"

// ClusterVersionCapabilityOptionalMonitoring manages the cluster monitoring stack which is responsible for gathering and
// processing metrics from the in-house and user workloads. The following CRDs are constitute this capability:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only these CRDs that are removed when this capability is disabled? Are there default monitoring workloads that may no longer be running on the cluster when this is disabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended. Disabling the capability drops all optional monitoring components (AM and the UWM stack) and any manifests related to them. Do note that CRDs will persist as these are needed by the operator regardless of the capability's state.

ClusterVersionCapabilityCloudCredential,
ClusterVersionCapabilityIngress,
ClusterVersionCapabilityCloudControllerManager,
ClusterVersionCapabilityOptionalMonitoring,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be included in the 4.19 set if the capability does not exist in 4.19?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped, targeting 4.21 TechPreview now.

ClusterVersionCapabilityCloudCredential,
ClusterVersionCapabilityIngress,
ClusterVersionCapabilityCloudControllerManager,
ClusterVersionCapabilityOptionalMonitoring,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be included in the 4.20 capability list if this capability does not exist in 4.20?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped, targeting 4.21 TechPreview now.

@everettraven
Copy link
Contributor

/assign

@rexagod
Copy link
Member Author

rexagod commented Oct 12, 2025

@everettraven Apologies! It seems there was some development before I got a chance to update this after marking it ready.

I'll make changes to implicitly enable the capability for TechPreview only in 4.21.

@rexagod rexagod marked this pull request as draft October 12, 2025 20:28
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2025
Add an "OptionalMonitoring" capability to support optional monitoring.

Signed-off-by: Pranshu Srivastava <[email protected]>
@rexagod rexagod force-pushed the opt-monitoring-capability branch from 72b018d to 051d592 Compare October 12, 2025 22:29
@rexagod rexagod marked this pull request as ready for review October 12, 2025 22:29
@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 12, 2025
Copy link
Contributor

openshift-ci bot commented Oct 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from everettraven. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested a review from everettraven October 12, 2025 22:30
@rexagod
Copy link
Member Author

rexagod commented Oct 12, 2025

@everettraven I'm probably missing something but I expected to find something similar to the feature-gates workflow (annotations/tags and such) to enable this just for TechPreview. Could you please drop some pointers on how I can achieve that? Thanks!

Copy link
Contributor

openshift-ci bot commented Oct 12, 2025

@rexagod: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@everettraven
Copy link
Contributor

everettraven commented Oct 13, 2025

@rexagod As far as I am aware, the process for tech preview capabilities is the same as adding any new enum value under TechPreviewNoUpgrade, meaning:

  • An OpenShift Enhancement Proposal
  • A new feature gate
  • Adding the new enum using a feature-gate aware enum marker

I'm relatively new to capabilities for OpenShift so I read up on the original intention of them in https://github.com/openshift/enhancements/blob/df1fada1e045b2e81b677941414f3085c02945fd/enhancements/installer/component-selection.md

With my current understanding of how capabilities work it should be an all-or-nothing type configuration. It seems like an anti-pattern to have a partial capability.

I'd like to see answered in the EP you write up:

  • Why your cluster operator cannot be responsible for removing optional monitoring configurations
  • What is and is not included in this capability and why it is okay to exclude some monitoring components and not others
  • What other components may have a dependency on these now optional components

Thanks!

Edit to clarify: Re-reading the description of the capability, I'm not sure a capability is what you are actually looking for here. This seems more like an option that should be configured on and managed by the CMO itself.

@rexagod
Copy link
Member Author

rexagod commented Oct 13, 2025

👋🏼 I'm a bit confused about the EP and feature-gate part, since https://github.com/openshift/enhancements/blob/df1fada1e045b2e81b677941414f3085c02945fd/enhancements/installer/component-selection.md#how-to-implement-a-new-capability doesn't mention those?

What is and is not included in this capability and why it is okay to exclude some monitoring components and not others

I've tried to answer this in the godoc for the capability but essentially this allows us to tune down the resource consumption of the in-cluster monitoring stack by not deploying optional components.

Why your cluster operator cannot be responsible for removing optional monitoring configurations
Re-reading the description of the capability, I'm not sure a capability is what you are actually looking for here. This seems more like an option that should be configured on and managed by the CMO itself.

We chose to go with capabilities to support configuring these at install time for clusters that are only concerned with forwarding telemetry metrics.

What other components may have a dependency on these now optional components

The optional components in question are Alertmanager and all UserWorkloadMonitoring components managed by CMO. The latter is opt-out by default, so no dependencies there. For AM, as I mentioned earlier, this could impact any areas that expect an AM presence 100% of the time (so far, monitoring-plugin, which we'll fix down the line to hide the "Alerts" tab when AM is absent).

@everettraven
Copy link
Contributor

I'm a bit confused about the EP and feature-gate part, since https://github.com/openshift/enhancements/blob/df1fada1e045b2e81b677941414f3085c02945fd/enhancements/installer/component-selection.md#how-to-implement-a-new-capability doesn't mention those?

This document looks like it hasn't been updated in the last couple years. Our processes have changed throughout this time such that any API change, especially to a v1 API, must go behind a feature gate in TechPreviewNoUpgrade. Creating a feature gate requires an OpenShift enhancement proposal. This is meant to ensure that we:

  • Don't inadvertently ship something as GA without a backing implementation
  • Don't GA something without appropriate regression testing

This is all meant to ensure we are maintaining a quality bar for changes that go into the product.

I've tried to answer this in the godoc for the capability but essentially this allows us to tune down the resource consumption of the in-cluster monitoring stack by not deploying optional components.

I appreciate you taking the time to answer some of this in the GoDoc. I'm still not quite sure I have an understanding from reading that why some pieces of the monitoring stack we deploy by default is considered optional and why others are not.

We chose to go with capabilities to support configuring these at install time for clusters that are only concerned with forwarding telemetry metrics.

My understanding of capabilities is that they are all-or-nothing type mechanisms to tell the ClusterVersionOperator whether or not to stamp $thing out onto the cluster, not a way to tell cluster operators what they should and should not be doing. I don't think this partial use case is one that the "capability" concept was intended to support.

I'm not sure off the top of my head if there is another way to achieve what you are looking for at install time, but maybe @JoelSpeed would be more familiar with the options you have here.

I've got a few questions here that having an EP would be helpful to field these discussion:

  • Why does a user need to configure this at install time?
  • What is the impact to an end user if they can only configure this option as a "day 2" operation?
  • Should users be able to enable/disable the optional pieces of the monitoring stack at will?

// * ThanosRuler User Workload Monitoring
//
// NOTE: The only supported use-case for this capability is to reduce the monitoring stack's resource usage by only
// supporting telemetry (see https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md). Turning

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// ClusterVersionCapabilityOptionalMonitoring manages the enablement of the optional components of the in-cluster monitoring stack.
// The optional components are:
// * Alertmanager
// * Alermanager User Workload Monitoring

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we say "All user-defined monitoring components"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants