operator: multicluster end-to-end observability — raft metrics, reconcile-health, StretchCluster member status, PrometheusRule, dashboard by hidalgopl · Pull Request #1509 · redpanda-data/redpanda-operator

hidalgopl · 2026-05-11T08:43:11Z

Ships first-class observability for the multicluster operator. One PR, three metric families plus chart/dashboard/docs.

What's in

1. Multicluster raft metrics (`operator_multicluster_raft_*`)

The raft layer powering cross-cluster leader election was operationally invisible — only structured logs and a peer.dropCount atomic surfaced via a periodic logger goroutine. Diagnosing flapping leaders, slow peers, or chronic drops required eyeballing per-pod logs across all peers. New metrics, registered to controller-runtime's metrics registry so they ship out the operator's existing /metrics endpoint:

Push-based (incremented at the event site):

leader_changes_total (Counter)
messages_sent_total{msg_type, peer} / messages_received_total{msg_type, peer} — msg_type is a closed vocabulary of ~10 raft message types
send_errors_total{peer, error_type} — error_type bucketed to timeout / canceled / unavailable / auth / marshal / other
messages_dropped_total{peer} — replaces the deleted runDropLogger goroutine + peer.dropCount atomic
send_duration_seconds{peer, result} — histogram with cross-region buckets (1ms..2.5s)
inflight_rpcs{peer} / peer_reachable{peer} (Gauges)
unreachable_reports_total{peer} / snapshots_sent_total{peer} / snapshot_send_errors_total{peer}
follower_match_lag_entries{peer} — leader-only

Pull-based via prometheus.Collector (read from transport atomics on scrape, no hot-path writes):

term (Gauge)
state{state=...} — one series per state with 0/1 value, no separate is_leader gauge
send_queue_length{peer} (Gauge)

The collector lives in pkg/multicluster/leaderelection/metrics.go, registered idempotently via RegisterTransport(t) after the transport's atomics are populated. Compile-time assertion (var _ prometheus.Collector = &transportCollector{}) ensures the interface stays implemented.

2. Reconcile-health metrics (`operator_controller_*`)

Wrapper-emitted automatically for every controller registered through observability.Wrap(reconciler, controller, defaultRequeueTimeout):

reconcile_steady_state_total{controller} — Counter, incremented when a reconcile returns "no work to do" — either (Result{}, nil) or (Result{RequeueAfter: defaultRequeueTimeout}, nil) matching the controller's configured periodic-requeue interval. The second shape is required because MulticlusterReconciler always returns RequeueAfter = periodicRequeue via a defer; without the dual-shape predicate the StretchCluster controller would never register as steady.
reconcile_last_success_timestamp_seconds{controller} — Gauge, Unix timestamp of the most recent steady-state reconcile. Prometheus computes "seconds since last success" at query time as time() - last_success_timestamp_seconds. Avoids the goroutine bookkeeping an imperative "seconds elapsed" gauge would need.

Wrap is generic over reconcile.TypedReconciler[R] so it covers both ctrl.Reconciler (single-cluster) and the multicluster reconciler (mcreconcile.Request) without duplicating the body.

Three controllers wrap: v2 Redpanda, NodePool, and StretchCluster. Two setup paths per controller — SetupWithManager (single-cluster binary, cmd/run) and SetupWithMultiClusterManager (multicluster binary, cmd/multicluster). Both paths wrap.

3. StretchCluster member-status metrics (`operator_stretchcluster_*`)

Per-member gauges, bounded cardinality (stretchcluster, member):

member_reachable — 0/1 from the multicluster manager's reachability probe. Local cluster is always 1, recorded under its canonical name via lifecycle.CanonicalClusterName.
brokers / brokers_ready — desired and ready broker counts per member, summed across NodePools pointing at that member. brokers - brokers_ready > 0 = partial outage.
replication_health{stretchcluster} — 0/1, cluster-wide from the admin API health check reconcileDecommission already runs.
spec_drift{stretchcluster, member} — 0/1, does each member's local StretchCluster.spec match the operator's view. Set inside checkSpecConsistency.

No new API calls — passive recorder, callers pass values they already have. MulticlusterReconciler is the only consumer; instrumented at three sites where the data already lives.

4. PrometheusRule (gated by `monitoring.rulesEnabled` chart value)

operator/chart/prometheusrule.go (transpiled to _prometheusrule.go.tpl by gotohelm). New monitoring.rulesEnabled is a sibling of monitoring.enabled (ServiceMonitor); independent so consumers can opt into rules without the ServiceMonitor.

Recording rules: operator:reconcile_rate:5m, operator:reconcile_error_rate:5m, operator:reconcile_steady_state_rate:5m, operator:reconcile_p99_seconds:5m.

Alerts, all severity=warning:

OperatorReconcileErrors — operator:reconcile_error_rate:5m > 0.1 for 5m
OperatorReconcileRunaway — operator:reconcile_rate:5m > 5 for 5m (the canonical "spinning controller" signal — cross-checks steady_state_total)
OperatorReconcileStalled — active in the past hour but reconcile rate == 0 for 10m
OperatorWorkerPoolSaturated — active_workers >= max_concurrent_reconciles for 10m
StretchClusterMemberUnreachable — 2m
StretchClusterBrokerCountSkew — 10m
StretchClusterSpecDrift — 5m
StretchClusterReplicationUnhealthy — 5m

5. Comprehensive Grafana dashboard (`docs/operator-grafana-dashboard.json`)

Single comprehensive dashboard, 5 rows: multicluster raft, StretchCluster member status, reconcile activity, queues & workers, reconcile-health signals.

Leader-only panels (Send latency p99, Follower match-lag entries) use and ignoring(...) to filter to the current leader's perspective — works in both dev-env (where remote_write adds a vcluster label) and prod (direct scrape with just instance).

6. Single source of truth for metric definitions

Every prometheus.New* call in the operator module now lives in operator/internal/observability/metrics.go (the v1 vectorized Cluster metrics in operator/internal/controller/vectorized/metric_controller.go are out of scope — v1 is legacy and explicitly not touched). Recorder helpers (RecordStretchCluster*) stay in stretch_recorder.go. The raft family stays in pkg/multicluster/leaderelection/metrics.go because that's a different Go module.

7. Documentation (`docs/operator-metrics.md`)

Canonical inventory of every metric the operator exposes. Cardinality table up front lists every label and its bounded vocabulary. Four groups: controller-runtime built-ins, reconcile-health, resource-state, multicluster raft.

8. Testing

Unit tests (wrapper_test.go): all four record-path branches — Result{} on a controller with defaultRequeueTimeout=0, Result{RequeueAfter: defaultRequeueTimeout} (periodic-steady), Result{RequeueAfter: other} (real requeue, not steady), errors, immediate-requeue. Plus passthrough.
Integration test (integration_test.go): TestIntegrationObservabilityInfiniteReconcile runs a synthetic reconciler inside a real controller-runtime Manager driven by envtest. Switches between spinning (RequeueAfter 100ms) and steady-state mid-test; asserts metrics react correctly. Gated by testutil.SkipIfNotIntegration + -tags integration.

Design notes worth flagging for reviewers

leader_id is deliberately not a numeric gauge — sum(leader_id) is meaningless and state{state="leader"} == 1 already identifies the leader on each peer.
state modelled as one series per state value (leader|follower|candidate|pre_candidate|unknown) with 0/1, not as a label-string-as-value. sum(state) == 1 invariant; state{state="leader"} == 1 is the standard leader filter.
Wrapper's isPeriodicRequeue returns false when defaultRequeueTimeout == 0 so a stray Result{Requeue: true} (RequeueAfter == 0) on a non-periodic controller doesn't accidentally register as periodic-steady. Plain Result{} still counts via the result.IsZero() branch.

…ter member status Three slices of operator observability that flow into a single PrometheusRule and a single Grafana dashboard. **Multicluster raft metrics.** The raft layer was operationally invisible — only structured logs and a `peer.dropCount` atomic surfaced via a periodic logger goroutine. New `operator_multicluster_raft_*` family registered to controller-runtime's metrics registry: `leader_changes_total`, `messages_{sent,received}_total{msg_type,peer}`, `send_errors_total{peer,error_type}` (closed six-value vocabulary: timeout/canceled/unavailable/auth/marshal/other), `messages_dropped_total{peer}`, `send_duration_seconds{peer,result}` (cross-region buckets 1ms..2.5s), `inflight_rpcs{peer}`, `peer_reachable{peer}`, `unreachable_reports_total{peer}`, `snapshots_sent_total{peer}`, `snapshot_send_errors_total{peer}`, and a leader-only `follower_match_lag_entries{peer}` (reads `node.Status().Progress`; followers keep prior value, federation expectation is "scrape from the leader"). Transport-backed gauges read existing atomics on scrape via `RegisterTransport(t)`: `term`, `state{state="leader|follower|candidate|pre_candidate|unknown"}` (one series per state with 0/1 value, no separate `is_leader` gauge), `send_queue_length{peer}`. `runDropLogger` and the `peer.dropCount` atomic are deleted — `messages_dropped_total{peer}` plus standard alerting covers it. `RegisterTransport` unregisters any prior transport collector before registering; safe no-op in prod (one transport per process), fixes test-ordering in `setupLockTest`. **Reconcile-health metrics.** Every controller already emits the controller-runtime built-ins but there are no signals tuned for self-triggered loops, falling behind on spec, or non-determinism in spec-rendering. New `operator/internal/observability/` package adds `Wrap[R](inner reconcile.TypedReconciler[R], controller string)` middleware that emits `operator_controller_reconcile_steady_state_total{controller}` (incremented when the inner returned `(Result{}, nil)` — a controller whose `reconcile_total` rate is high but `steady_state` rate is flat is spinning) and `operator_controller_reconcile_requeue_after_seconds{controller}` (histogram of `Result.RequeueAfter`; tight cluster of sub-second values = retry loop). Generic over `reconcile.TypedReconciler[R]` so the same wrapper covers both `ctrl.Reconciler` and the multicluster reconciler. Two passive recorder helpers for per-object signals that need an object reference: `RecordObservedGeneration(controller, kind, gen, obsGen)` → `operator_controller_reconcile_observed_generation_drift` (clamps negative deltas to zero), and `RecordSpecHashChangedWithoutGeneration(controller, kind)` → `operator_controller_reconcile_spec_hash_changed_without_generation_total` (canonical non-determinism signal). Both leave it to the calling controller to decide when to record so the observability layer never duplicates API reads. Three controllers wrap at `SetupWithManager`: v2 Redpanda, v2 NodePool, and the multicluster StretchCluster reconciler. Other controllers (Console, vectorized v1, decommissioners, PVCUnbinder, NodeWatcher) keep their built-ins un-wrapped — they don't manage the resources in scope. **StretchCluster member-status metrics.** Where `operator_controller_*` describes how the controllers behave, these describe what they're managing. All gauges, bounded label cardinality (`stretchcluster`, `member`): `operator_stretchcluster_member_reachable` (0/1 from the multicluster manager's reachability probe; local cluster always 1, recorded under its canonical name via `lifecycle.CanonicalClusterName` rather than the multicluster-runtime's empty-string sentinel); `operator_stretchcluster_brokers` / `operator_stretchcluster_brokers_ready` (desired and ready broker counts per member, summed across NodePools pointing at that member — gap on a single member = partial outage); `operator_stretchcluster_replication_health{stretchcluster}` (0/1, cluster-wide from the admin API check `reconcileDecommission` already makes, recorded right after the call returns); `operator_stretchcluster_spec_drift{stretchcluster, member}` (0/1, does each member's local StretchCluster.spec match the operator's view; set inside the existing `checkSpecConsistency` routine). No new API calls — passive recorder, callers pass values they already have. `MulticlusterReconciler` is the only consumer; instrumented at three sites where the data already lives (`checkSpecConsistency`, `reconcileDecommission`, and a new `recordBrokerCountMetrics` helper called once per reconcile after `fetchInitialState`). Unreachable members are recorded as `member_reachable=0` but `spec_drift` keeps its prior value because we genuinely don't know. **Chart artifacts.** `operator/chart/prometheusrule.go` (transpiled to `_prometheusrule.go.tpl` by gotohelm) emits a PrometheusRule with recording rules and two alert groups — reconcile health (`OperatorReconcileErrors`, `OperatorReconcileRunaway`, `OperatorReconcileStalled`, `OperatorWorkerPoolSaturated`, `OperatorObservedGenerationDrift`, `OperatorNonDeterministicSpec`) and StretchCluster (`StretchClusterMemberUnreachable` 2m, `StretchClusterBrokerCountSkew` 10m, `StretchClusterSpecDrift` 5m, `StretchClusterReplicationUnhealthy` 5m). All severity `warning` — indicators that need eyes, not page-now incidents. New `values.monitoring.rulesEnabled` (default `false`) — sibling of `monitoring.enabled` (ServiceMonitor); independent so consumers can opt into rules without the ServiceMonitor. Chart test case `monitoring-rules-enabled` locks the output into the golden file. `docs/operator-grafana-dashboard.json` rewritten as a single comprehensive dashboard: 29 panels across 5 rows covering reconcile health, StretchCluster member status, and multicluster raft. `docs/operator-metrics.md` is the canonical inventory restructured into four groups (controller-runtime built-ins / reconcile-health / resource-state / multicluster raft); cardinality table up top lists every label and its closed vocabulary; explicit "PLANNED (not yet emitted)" subsection lists `self_triggered_total` / `time_since_last_success_seconds` so downstream dashboards don't break later. **Design notes.** Plain `prometheus/client_golang`, not OTel — matches the existing pattern in `operator/internal/controller/vectorized/metric_controller.go`, `operator/cmd/version/version.go`, `operator/pkg/client/kgo_hooks.go`. `leader_id` is deliberately not a numeric gauge — `sum(leader_id)` is meaningless and `state{state="leader"}` already identifies the leader on each peer. `leader_changes_total` is incremented in the Ready loop on `leader != prevLeader && leader != 0`; scraping from the leader gives the cluster-wide leader-change count. `error_type` bucketing keeps cardinality bounded and each bucket maps to a different on-call story. Tests cover `msgTypeLabel` and `normaliseRaftState` exhaustively, plus an integration test that brings up a 3-node cluster via the existing `setupLockTest` harness and asserts gauges reflect elected state and heartbeat counters accrue from natural traffic. `operator/internal/observability/wrapper_test.go` covers the steady-state counter, requeue-after histogram, observed-generation drift clamping, and the spec-hash counter. Chart golden file regenerated for `monitoring-rules-enabled`.

…ntegration test Adds the two reconcile-health metrics that the existing observability inventory documented as PLANNED / Reserved: * operator_controller_reconcile_last_success_timestamp_seconds (gauge, wrapper-emitted). Set inside the existing steady-state branch via a one-line write of the current Unix timestamp. Bounded cardinality (controller label only). Prometheus computes the user-facing "seconds since last success" as time() - this_gauge — no goroutine bookkeeping or oldest-unfinished tracker required. * operator_controller_reconcile_self_triggered_total (counter, opt-in via observability.RecordSelfTriggered). Increments when a controller has detected that its own write to an object will re-enqueue the same reconcile without any other observable effect. The wrapper deliberately does not increment this — that would require a redundant Get to hash before/after every reconcile, breaking the wrapper's "passive, no extra reads" design. Controllers opt in from their own write helpers where the pre/post-write state is already in hand. The wrapper exposes nowUnix as a var so deterministic tests can drive the gauge without clock skew. Dashboard: the existing "Self-triggered reconciles" stat panel drops its "reserved for future / renders N/A" language and gains a real description pointing at RecordSelfTriggered. A new full-width timeseries panel renders time() - last_success_timestamp_seconds per controller as seconds-since-last-success — climbing past natural re-queue intervals means the controller is failing or spinning. Tests: * Unit tests cover the new wrapper gauge (advances on steady state, frozen on error and on RequeueAfter), and the RecordSelfTriggered helper. * TestIntegrationObservabilityInfiniteReconcile drives the wrapper through a real controller-runtime Manager against an envtest apiserver. A test reconciler watches ConfigMaps and switches between spinning (RequeueAfter 100ms) and steady-state via an atomic. Asserts that the spinning phase keeps last_success_timestamp_seconds and steady_state_total at 0 while the requeue histogram fills, then validates recovery once the mode flips. testutil.SkipIfNotIntegration gates it behind -tags integration. Docs: operator-metrics.md moves both metrics out of the "Reserved (currently silent)" section into the live wrapper / recorder tables. The Reserved section is now gone. Changie entry under operator-Added-*. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First opt-in caller of observability.RecordSelfTriggered. The MulticlusterReconciler's syncStatus loop pushes the local StretchCluster.Status to every reachable peer via Status().Update(). When the remote's existing status is semantically identical to what we're about to write, the Update still bumps resourceVersion and re-enqueues the StretchCluster reconciler on that peer — the canonical infinite-reconcile shape. apiequality.Semantic.DeepEqual is already imported and used elsewhere in this file for the spec-drift check; reuse it here for the status-equality probe. Recorded after a successful Update so transient write errors don't pollute the counter. The metric powers the dashboard's "Self-triggered reconciles" stat panel — sustained non-zero rate is an indicator that the syncStatus path should diff before writing rather than writing unconditionally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…stogram The wrapper's previous steady-state predicate only matched (Result{}, nil) — a clean "no work" return. But the MulticlusterReconciler always returns RequeueAfter = periodicRequeue via a defer (the canonical "wake me up periodically" pattern), so it never entered the steady-state branch. Result: last_success_timestamp_seconds and steady_state_total stayed permanently empty for that controller despite it being healthy. Wrap now takes a third argument: defaultRequeueTimeout. The record() branch is now: err == nil && (result.IsZero() || isPeriodicRequeue(result)) Both shapes count as steady state. isPeriodicRequeue returns true when result.RequeueAfter exactly equals defaultRequeueTimeout (and false when defaultRequeueTimeout == 0, so a stray Result{Requeue: true} on a non-periodic controller doesn't accidentally register). The requeue-after histogram now skips the periodic value — otherwise the periodic-wake samples would dominate every bucket and bury the tight-retry-loop signal the histogram exists to surface. Three call sites updated to pass the right periodic value: * redpanda_controller.go:157 → periodicRequeue * nodepool_controller.go:155 → periodicRequeue * multicluster_controller.go → defaultReconcileTimeout Tests cover all four paths: * Result{} on a controller with defaultRequeueTimeout=0 → steady * Result{RequeueAfter: defaultRequeueTimeout} → steady * Result{RequeueAfter: other} → not steady, observed in histogram * Result{RequeueAfter: defaultRequeueTimeout} → NOT observed in histogram Docs and recorder.go godoc updated to reflect the dual-shape steady-state definition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A critical second pass on the operator_controller_* family revealed three metrics that were defined and exported but had zero call sites: passive opt-in helpers waiting for a controller to wire them, which never happened. Same shape as the self_triggered_total metric that was removed in an earlier pass. Removed metrics (with their helpers, tests, alerts, and dashboard panels): * operator_controller_reconcile_self_triggered_total (was wired to a syncStatus deep-equal probe in multicluster_controller.go — that wiring is reverted here too; the canonical "spinning loop" pattern is already detected by rate(reconcile_total) > rate(steady_state_total) and the OperatorReconcileRunaway alert). * operator_controller_reconcile_observed_generation_drift (gauge + RecordObservedGeneration helper; alert OperatorObservedGenerationDrift; dashboard panel 21). Generation drift is detectable on a per-resource basis from status.observedGeneration directly if a future use case requires it. * operator_controller_reconcile_spec_hash_changed_without_generation_total (counter + RecordSpecHashChangedWithoutGeneration helper; alert OperatorNonDeterministicSpec; dashboard panel 22). A passive opt-in helper with zero call sites is dead code. Net surface of the operator_controller_* family is now three wrapper-emitted metrics — no opt-in helpers, no dead code: * reconcile_steady_state_total (counter) * reconcile_requeue_after_seconds (histogram) * reconcile_last_success_timestamp_seconds (gauge) While here: * The Wrap call in multicluster_controller.go was passing defaultReconcileTimeout (2m, the per-reconcile context deadline) instead of periodicRequeue (3m, what the reconciler actually returns). That meant the wrapper's isPeriodicRequeue predicate never matched and last_success_timestamp_seconds stayed empty for the StretchCluster controller. Fixed. * Dashboard panels 20 and 24 expanded to full width to fill the gaps left by deleted panels 21 / 22. Version bumped to 6. * Docs (operator-metrics.md), changelog entries, and the chart's prometheusrule.go all reflect the reduced surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Metric definitions were spread across three files in the operator module: * operator/internal/observability/recorder.go (3 operator_controller_*) * operator/internal/observability/stretch_recorder.go (5 operator_stretchcluster_*) * operator/internal/controller/redpanda/metric_controller.go (4 v2 redpanda_*) To answer "what metrics does this operator expose?" you had to grep three files. Consolidating: every prometheus.New* call now lives in operator/internal/observability/metrics.go, grouped into three sections (reconcile-health, StretchCluster, Redpanda v2 CR resource-state) with a single init() that registers all 12 metrics. The redpanda metric_controller.go reconciler imports the exported vars from the observability package instead of defining its own. The v2 metric vars are now exported (Redpandas, RedpandaDesiredNodes, RedpandaReadyNodes, RedpandaMisconfiguredClusters) instead of package-private. recorder.go is reduced to the package docstring (kept as the canonical place for the package-level godoc). stretch_recorder.go is reduced to the four RecordStretchCluster* helper functions. Out of scope: * v1 (vectorized.redpanda.com Cluster) metrics — that controller is legacy and explicitly excluded from unrelated changes. Its 4 metrics remain defined next to its reconciler. * Multicluster raft metrics in pkg/multicluster/leaderelection/ metrics.go — different Go module; cross-module consolidation would either expose operator/internal/observability/ to all pkg/ consumers or pull controller-runtime deps into the pkg module. No emitted metric names change. No alerts or dashboards affected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The reconcile_requeue_after_seconds histogram was a triage-aid metric distinguishing self-requeue churn from external-event- driven churn, but in practice the primary spinning detector (rate(reconcile_total) > 5 while rate(steady_state_total) == 0, fired by OperatorReconcileRunaway) is sufficient and the histogram's added diagnostic value rarely justified the maintenance cost — 11 bucket counters per controller plus a wrapper-side write per non-zero non-periodic RequeueAfter return. The wrapper retains its isPeriodicRequeue predicate because the steady-state branch still needs it to recognise periodic-wake returns as steady. The dashboard "Reconcile-health signals" section now contains a single full-width panel (time() - last_success_timestamp_seconds). Net surface of operator_controller_* after this drop: * reconcile_steady_state_total (counter) * reconcile_last_success_timestamp_seconds (gauge) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s, changie Five small follow-up changes from a final review pass over the observability work: 1. operator/internal/controller/redpanda/nodepool_controller.go — wrap NodePool with observability.Wrap in SetupWithMultiClusterManager. The multicluster binary (cmd/multicluster) calls this path; only the single-cluster binary's SetupWithManager was wrapping. Result: NodePool emitted controller-runtime built-in metrics in the multicluster binary but nothing from the operator_controller_* family — its dashboard panels were permanently empty. Found via dogfooding the dashboard against the dev env. 2. docs/operator-grafana-dashboard.json — fix two leader-only panel queries ("Send latency p99 (per peer, leader's view)" and "Follower match-lag entries (leader's view)"). The old form used `sum by (le, peer) (...) * on(instance) group_left() (leader==1)`, which drops `instance` in the aggregation and then tries to join on it. Replaced with `and ignoring(le, peer, result, state) (leader==1)` — set-intersection that joins on every shared identity label, works in both dev (with the dev-env `vcluster` label) and prod (direct scrape with just `instance`). 3. pkg/multicluster/leaderelection/metrics.go — compile-time assertion `var _ prometheus.Collector = &transportCollector{}` so a missing or signature-drifted Describe / Collect method fails the build instead of failing at runtime registration. 4. docs/operator-metrics.md — clean up stale references left over from the metric audit (mentions of generation drift / non-determinism in the Group 2 intro, the orphaned `kind` label in the cardinality table, the opt-in Record* helpers that no longer exist, and the dashboard cross-link framing). 5. .changes/unreleased/ — replace three incremental Added entries with one consolidated entry covering the whole observability story. The previous incremental entries described slices in commit-time order and the raft family (the original subject of this PR) had no dedicated entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hidalgopl requested review from RafalKorepta, andrewstucki, chrisseto and gene-redpanda as code owners May 11, 2026 08:43

hidalgopl marked this pull request as draft May 11, 2026 08:43

hidalgopl force-pushed the pb/multicluster-raft-metrics branch 2 times, most recently from fce0d26 to fc0d602 Compare May 12, 2026 08:46

hidalgopl changed the base branch from pb/multicluster-operator-debug-bundle to main May 12, 2026 09:28

hidalgopl force-pushed the pb/multicluster-raft-metrics branch from fc0d602 to 32963a4 Compare May 13, 2026 09:02

hidalgopl changed the title ~~multicluster: emit raft metrics for cross-cluster leader election~~ operator: multicluster end-to-end observability — raft metrics, reconcile-health, StretchCluster member status, PrometheusRule, dashboard May 13, 2026

hidalgopl marked this pull request as ready for review May 13, 2026 11:49

hidalgopl force-pushed the pb/multicluster-raft-metrics branch 3 times, most recently from 83f91a2 to 55fee90 Compare May 13, 2026 12:21

hidalgopl and others added 8 commits May 13, 2026 14:38

hidalgopl force-pushed the pb/multicluster-raft-metrics branch from 55fee90 to f0c05ba Compare May 13, 2026 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: multicluster end-to-end observability — raft metrics, reconcile-health, StretchCluster member status, PrometheusRule, dashboard#1509

operator: multicluster end-to-end observability — raft metrics, reconcile-health, StretchCluster member status, PrometheusRule, dashboard#1509
hidalgopl wants to merge 8 commits into
mainfrom
pb/multicluster-raft-metrics

hidalgopl commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hidalgopl commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in

1. Multicluster raft metrics (operator_multicluster_raft_*)

2. Reconcile-health metrics (operator_controller_*)

3. StretchCluster member-status metrics (operator_stretchcluster_*)

4. PrometheusRule (gated by monitoring.rulesEnabled chart value)

5. Comprehensive Grafana dashboard (docs/operator-grafana-dashboard.json)

6. Single source of truth for metric definitions

7. Documentation (docs/operator-metrics.md)

8. Testing

Design notes worth flagging for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hidalgopl commented May 11, 2026 •

edited

Loading

1. Multicluster raft metrics (`operator_multicluster_raft_*`)

2. Reconcile-health metrics (`operator_controller_*`)

3. StretchCluster member-status metrics (`operator_stretchcluster_*`)

4. PrometheusRule (gated by `monitoring.rulesEnabled` chart value)

5. Comprehensive Grafana dashboard (`docs/operator-grafana-dashboard.json`)

7. Documentation (`docs/operator-metrics.md`)