-
Notifications
You must be signed in to change notification settings - Fork 380
MON-4361,MON-4380: Optional Monitoring Capability #2675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@rexagod: This pull request references MON-4361 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rexagod The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c4db7d8
to
a322ff8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to wait for #2649 since it's migrating all the dashboards to static assets.
On the jsonnet implementation side, I wonder if it wouldn't easier to read/maintain if we inject the annotation into each component that needs it.
E.g. here for components for which all resources are OptionalMonitoring
cluster-monitoring-operator/jsonnet/main.jsonnet
Lines 520 to 539 in ea9a533
{ ['alertmanager/' + name]: inCluster.alertmanager[name] for name in std.objectFields(inCluster.alertmanager) } + | |
{ ['alertmanager-user-workload/' + name]: userWorkload.alertmanager[name] for name in std.objectFields(userWorkload.alertmanager) } + | |
{ ['cluster-monitoring-operator/' + name]: inCluster.clusterMonitoringOperator[name] for name in std.objectFields(inCluster.clusterMonitoringOperator) } + | |
{ ['dashboards/' + name]: inCluster.dashboards[name] for name in std.objectFields(inCluster.dashboards) } + | |
{ ['kube-state-metrics/' + name]: inCluster.kubeStateMetrics[name] for name in std.objectFields(inCluster.kubeStateMetrics) } + | |
{ ['node-exporter/' + name]: inCluster.nodeExporter[name] for name in std.objectFields(inCluster.nodeExporter) } + | |
{ ['openshift-state-metrics/' + name]: inCluster.openshiftStateMetrics[name] for name in std.objectFields(inCluster.openshiftStateMetrics) } + | |
{ ['prometheus-k8s/' + name]: inCluster.prometheus[name] for name in std.objectFields(inCluster.prometheus) } + | |
{ ['admission-webhook/' + name]: inCluster.admissionWebhook[name] for name in std.objectFields(inCluster.admissionWebhook) } + | |
{ ['prometheus-operator/' + name]: inCluster.prometheusOperator[name] for name in std.objectFields(inCluster.prometheusOperator) } + | |
{ ['prometheus-operator-user-workload/' + name]: userWorkload.prometheusOperator[name] for name in std.objectFields(userWorkload.prometheusOperator) } + | |
{ ['prometheus-user-workload/' + name]: userWorkload.prometheus[name] for name in std.objectFields(userWorkload.prometheus) } + | |
{ ['metrics-server/' + name]: inCluster.metricsServer[name] for name in std.objectFields(inCluster.metricsServer) } + | |
// needs to be removed once remote-write is allowed for sending telemetry | |
{ ['telemeter-client/' + name]: inCluster.telemeterClient[name] for name in std.objectFields(inCluster.telemeterClient) } + | |
{ ['monitoring-plugin/' + name]: inCluster.monitoringPlugin[name] for name in std.objectFields(inCluster.monitoringPlugin) } + | |
{ ['thanos-querier/' + name]: inCluster.thanosQuerier[name] for name in std.objectFields(inCluster.thanosQuerier) } + | |
{ ['thanos-ruler/' + name]: inCluster.thanosRuler[name] for name in std.objectFields(inCluster.thanosRuler) } + | |
{ ['control-plane/' + name]: inCluster.controlPlane[name] for name in std.objectFields(inCluster.controlPlane) } + | |
{ ['manifests/' + name]: inCluster.manifests[name] for name in std.objectFields(inCluster.manifests) } + |
Or at the level of the jsonnet component file in case it's per resource.
CHANGELOG.md
Outdated
- `KubePdbNotEnoughHealthyPods` | ||
- `KubeNodePressure` | ||
- `KubeNodeEviction` | ||
- []() Allow cluster-admins to opt-into optional monitoring using the `OptionalMonitoring` capability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that adding the annotation to the manifests under the assets/
directory will have no direct effect since there's no logic in CMO to deploy these resources conditionally, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've raised a PR for that: https://github.com/openshift/cluster-monitoring-operator/pull/2688/files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
461f6b9
to
cbdd203
Compare
Reverted the |
kind: ValidatingWebhookConfiguration | ||
metadata: | ||
annotations: | ||
capability.openshift.io/name: OptionalMonitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the service is optional (*), shouldn't we apply the annotation to all admission-webhook resources?
(*) there could be an argument that we still want the admission webhook for PrometheusRule resources because of telemetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that was my understanding as well, so I limited this to AM strictly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't realize that this webhook configuration was specifically for AlertmanagerConfig resources. I would still recommend that we document the rationale for not having all admission-webhook resources marked as optional (not sure where it should happen though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a description annotation on the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to this change but if the console is disabled, wouldn't it be logical to avoid deploying the monitoring plugin resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I've moved the plugin to be independent of the OptionalMonitoring
capability and instead made it dependent on the Console
one, which is in-line with its task's behavior, PTAL at commit: 55d6da0.
annotations: | ||
api-approved.openshift.io: https://github.com/openshift/api/pull/1406 | ||
api.openshift.io/merged-by-featuregates: "true" | ||
capability.openshift.io/name: OptionalMonitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure that CMO will start if the CRDs aren't present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excluded.
kind: CustomResourceDefinition | ||
metadata: | ||
annotations: | ||
capability.openshift.io/name: OptionalMonitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here: IIRC the Prometheus operator will (at the minimum) log errors if the CRDs aren't installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excluded.
Metric rules and metrics exporters have not been opted-in to keep the telemetry rules functioning. Optional components include: * Alertmanager * AlertmanagerUWM * ClusterMonitoringOperatorDeps (partially, for AM) * MonitoringPlugin * PromtheusOperator (partially, for AM) * PromtheusOperatorUWM * ThanosRuler Signed-off-by: Pranshu Srivastava <[email protected]>
Drop `monitoring-plugin` from optional components as its deployment should only rely on the `Console` capability. This is in-line with its corresponding task's behavior as well. Also refactored the `jsonnet` code a bit. Signed-off-by: Pranshu Srivastava <[email protected]>
Enabling the `OptionalMonitoring` capability translates to enabling all optional monitoring components under CMO. Note that since capabilities cannot be disabled once enabled, so cleanup for optional monitoring resources is not necessary. To clarify further, there are two possible paths at install time: * capability is disabled -> enabled (no need to cleanup) * capability is enabled -/> (cannot be disabled) (no need to cleanup) Signed-off-by: Pranshu Srivastava <[email protected]>
x-posting from #2688:
I believe that would require us to bake that behavior into several I also think a per-component exclusion at component initialization lets us dodge the need for a cache-based EDIT: I've refactored the patch to not modify the tasks' orchestration, but the tasks themselves as I realised that we don't actually need to think about cleaning these up, quoting the third commit:
|
@rexagod: This pull request references MON-4361 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set. This pull request references MON-4380 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Excluding CRDs from optional monitoring as their absence can cause CMO and PO to crash or throw errors at the very least. Signed-off-by: Pranshu Srivastava <[email protected]>
pkg/client/client.go
Outdated
} | ||
|
||
func (c *Client) HasOptionalMonitoringCapability(ctx context.Context) (bool, error) { | ||
return c.HasClusterCapability(ctx, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
Needs openshift/api#2468
kind: ValidatingWebhookConfiguration | ||
metadata: | ||
annotations: | ||
capability.openshift.io/name: OptionalMonitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't realize that this webhook configuration was specifically for AlertmanagerConfig resources. I would still recommend that we document the rationale for not having all admission-webhook resources marked as optional (not sure where it should happen though).
}, | ||
}, | ||
}, | ||
local addAnnotationToChildren(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local addAnnotationToChildren(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) = | |
local addAnnotationToChildren(parent, key, value) = |
@@ -0,0 +1,31 @@ | |||
{ | |||
local addAnnotationToChild(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) o
(for object) instead of parent?
local addAnnotationToChild(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) = | |
local addAnnotationToChild(o, key, value) = |
local annotationKeyCapability = 'capability.openshift.io/name', | ||
local annotationValueConsoleCapability = 'Console', | ||
local annotationValueOptionalMonitoringCapability = 'OptionalMonitoring', | ||
consoleForObject(o): addAnnotationToChild(o, annotationKeyCapability, annotationValueConsoleCapability), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(suggestion) can we add a small comment before each function to describe what it does?
pkg/tasks/alertmanager.go
Outdated
|
||
func (t *AlertmanagerTask) Run(ctx context.Context) error { | ||
if t.config.ClusterMonitoringConfiguration.AlertmanagerMainConfig.IsEnabled() { | ||
optionalMonitoringEnabled, err := t.client.HasOptionalMonitoringCapability(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have the operator retrieve the information about monitoring capability once before reconciling and provide the information to each task runner?
In the same vein, we create a small helper which would wrap a "Task" and run it only if the monitoring capability is enabled (e.g. useful for "Task"s for which all managed resources are dependent on the capability).
jsonnet/versions.yaml
Outdated
nodeExporter: 1.9.1 | ||
promLabelProxy: 0.12.1 | ||
prometheus: 3.5.0 | ||
prometheus: 3.6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) don't bother about updating the versions.yaml file in this PR.
Signed-off-by: Pranshu Srivastava <[email protected]>
kind: CustomResourceDefinition | ||
metadata: | ||
annotations: | ||
capability.openshift.io/name: OptionalMonitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: OptionalMonitoring
seems like a fairly redundant name for an optional capability? Is there no more specific/descriptive name we can use for the portion of the monitoring component that's covered by this capability name? If the naming has already been hashed out in an enhancement or something, just point me at that discussion. I'm not trying to re-open something that's already settled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OptionalMonitoring
indicates a capability that targets the optional components of the monitoring stack. Internally, we've used the same terminology for all components that are not required for telemetry. In retrospect, this seemed better than, say, NonTelemetryComponents
, but open to ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I followed these steps to go about implementing this capability, so I didn't create an EP, but PLMK if that's necessary)
Signed-off-by: Pranshu Srivastava <[email protected]>
Trying to spin up a cluster for the better part of today, but no luck, so I still need to test out the latest commit(s) just to be sure. |
@rexagod: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Missed this part, on it. |
Metric rules and metrics exporters have not been opted-in to keep the telemetry rules functioning.