-
Notifications
You must be signed in to change notification settings - Fork 574
MON-4359: Add Monitoring Capability #2468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
Hello @rexagod! Some important instructions when contributing to openshift/api: |
Blocked by: openshift/api#2468 Signed-off-by: Pranshu Srivastava <[email protected]>
@rexagod: This pull request references MON-4359 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
6d757d5
to
72b018d
Compare
config/v1/types_cluster_version.go
Outdated
|
||
// ClusterVersionCapabilityOptionalMonitoring manages the cluster monitoring stack which is responsible for gathering and | ||
// processing metrics from the in-house and user workloads. The following CRDs are constitute this capability: | ||
// - TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we should not add this capability until we know the full scope of what this capability will and will not do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended.
config/v1/types_cluster_version.go
Outdated
// WARNING: This capability will drop all aforementioned CRDs, and may operational issues in the cluster. | ||
// The only supported use-case for this capability is to reduce the monitoring stack's resource usage by only | ||
// supporting telemetry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What operational issues may happen here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "only supporting telemetry" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added context.
Operations relying the optional components, i.e., Alertmanager, and all UWM components will be disabled. For e.g., the "Alerts" tab under "Observe" will stop showing alerting (firing, pending, etc.).
Telemetry here refers to https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md that teams can use to gather telemetry metrics about their workloads from clusters.
config/v1/types_cluster_version.go
Outdated
ClusterVersionCapabilityCloudControllerManager ClusterVersionCapability = "CloudControllerManager" | ||
|
||
// ClusterVersionCapabilityOptionalMonitoring manages the cluster monitoring stack which is responsible for gathering and | ||
// processing metrics from the in-house and user workloads. The following CRDs are constitute this capability: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only these CRDs that are removed when this capability is disabled? Are there default monitoring workloads that may no longer be running on the cluster when this is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended. Disabling the capability drops all optional monitoring components (AM and the UWM stack) and any manifests related to them. Do note that CRDs will persist as these are needed by the operator regardless of the capability's state.
config/v1/types_cluster_version.go
Outdated
ClusterVersionCapabilityCloudCredential, | ||
ClusterVersionCapabilityIngress, | ||
ClusterVersionCapabilityCloudControllerManager, | ||
ClusterVersionCapabilityOptionalMonitoring, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this be included in the 4.19 set if the capability does not exist in 4.19?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped, targeting 4.21 TechPreview now.
config/v1/types_cluster_version.go
Outdated
ClusterVersionCapabilityCloudCredential, | ||
ClusterVersionCapabilityIngress, | ||
ClusterVersionCapabilityCloudControllerManager, | ||
ClusterVersionCapabilityOptionalMonitoring, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this be included in the 4.20 capability list if this capability does not exist in 4.20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped, targeting 4.21 TechPreview now.
/assign |
@everettraven Apologies! It seems there was some development before I got a chance to update this after marking it ready. I'll make changes to implicitly enable the capability for TechPreview only in 4.21. |
Add an "OptionalMonitoring" capability to support optional monitoring. Signed-off-by: Pranshu Srivastava <[email protected]>
72b018d
to
051d592
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@everettraven I'm probably missing something but I expected to find something similar to the feature-gates workflow (annotations/tags and such) to enable this just for TechPreview. Could you please drop some pointers on how I can achieve that? Thanks! |
@rexagod: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@rexagod As far as I am aware, the process for tech preview capabilities is the same as adding any new enum value under TechPreviewNoUpgrade, meaning:
I'm relatively new to capabilities for OpenShift so I read up on the original intention of them in https://github.com/openshift/enhancements/blob/df1fada1e045b2e81b677941414f3085c02945fd/enhancements/installer/component-selection.md With my current understanding of how capabilities work it should be an all-or-nothing type configuration. It seems like an anti-pattern to have a partial capability. I'd like to see answered in the EP you write up:
Thanks! Edit to clarify: Re-reading the description of the capability, I'm not sure a capability is what you are actually looking for here. This seems more like an option that should be configured on and managed by the CMO itself. |
👋🏼 I'm a bit confused about the EP and feature-gate part, since https://github.com/openshift/enhancements/blob/df1fada1e045b2e81b677941414f3085c02945fd/enhancements/installer/component-selection.md#how-to-implement-a-new-capability doesn't mention those?
I've tried to answer this in the godoc for the capability but essentially this allows us to tune down the resource consumption of the in-cluster monitoring stack by not deploying optional components.
We chose to go with capabilities to support configuring these at install time for clusters that are only concerned with forwarding telemetry metrics.
The optional components in question are Alertmanager and all UserWorkloadMonitoring components managed by CMO. The latter is opt-out by default, so no dependencies there. For AM, as I mentioned earlier, this could impact any areas that expect an AM presence 100% of the time (so far, monitoring-plugin, which we'll fix down the line to hide the "Alerts" tab when AM is absent). |
This document looks like it hasn't been updated in the last couple years. Our processes have changed throughout this time such that any API change, especially to a v1 API, must go behind a feature gate in TechPreviewNoUpgrade. Creating a feature gate requires an OpenShift enhancement proposal. This is meant to ensure that we:
This is all meant to ensure we are maintaining a quality bar for changes that go into the product.
I appreciate you taking the time to answer some of this in the GoDoc. I'm still not quite sure I have an understanding from reading that why some pieces of the monitoring stack we deploy by default is considered optional and why others are not.
My understanding of capabilities is that they are all-or-nothing type mechanisms to tell the ClusterVersionOperator whether or not to stamp $thing out onto the cluster, not a way to tell cluster operators what they should and should not be doing. I don't think this partial use case is one that the "capability" concept was intended to support. I'm not sure off the top of my head if there is another way to achieve what you are looking for at install time, but maybe @JoelSpeed would be more familiar with the options you have here. I've got a few questions here that having an EP would be helpful to field these discussion:
|
// * ThanosRuler User Workload Monitoring | ||
// | ||
// NOTE: The only supported use-case for this capability is to reduce the monitoring stack's resource usage by only | ||
// supporting telemetry (see https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md). Turning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather link to the official docs: https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/support/remote-health-monitoring-with-connected-clusters
// ClusterVersionCapabilityOptionalMonitoring manages the enablement of the optional components of the in-cluster monitoring stack. | ||
// The optional components are: | ||
// * Alertmanager | ||
// * Alermanager User Workload Monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we say "All user-defined monitoring components"?
Add a "Monitoring" capability to support optional monitoring.