Skip to content

operator: multicluster end-to-end observability — raft metrics, reconcile-health, StretchCluster member status, PrometheusRule, dashboard#1509

Open
hidalgopl wants to merge 8 commits into
mainfrom
pb/multicluster-raft-metrics
Open

operator: multicluster end-to-end observability — raft metrics, reconcile-health, StretchCluster member status, PrometheusRule, dashboard#1509
hidalgopl wants to merge 8 commits into
mainfrom
pb/multicluster-raft-metrics

Conversation

@hidalgopl
Copy link
Copy Markdown
Contributor

@hidalgopl hidalgopl commented May 11, 2026

Ships first-class observability for the multicluster operator. One PR, three metric families plus chart/dashboard/docs.

What's in

1. Multicluster raft metrics (operator_multicluster_raft_*)

The raft layer powering cross-cluster leader election was operationally invisible — only structured logs and a peer.dropCount atomic surfaced via a periodic logger goroutine. Diagnosing flapping leaders, slow peers, or chronic drops required eyeballing per-pod logs across all peers. New metrics, registered to controller-runtime's metrics registry so they ship out the operator's existing /metrics endpoint:

Push-based (incremented at the event site):

  • leader_changes_total (Counter)
  • messages_sent_total{msg_type, peer} / messages_received_total{msg_type, peer}msg_type is a closed vocabulary of ~10 raft message types
  • send_errors_total{peer, error_type}error_type bucketed to timeout / canceled / unavailable / auth / marshal / other
  • messages_dropped_total{peer} — replaces the deleted runDropLogger goroutine + peer.dropCount atomic
  • send_duration_seconds{peer, result} — histogram with cross-region buckets (1ms..2.5s)
  • inflight_rpcs{peer} / peer_reachable{peer} (Gauges)
  • unreachable_reports_total{peer} / snapshots_sent_total{peer} / snapshot_send_errors_total{peer}
  • follower_match_lag_entries{peer} — leader-only

Pull-based via prometheus.Collector (read from transport atomics on scrape, no hot-path writes):

  • term (Gauge)
  • state{state=...} — one series per state with 0/1 value, no separate is_leader gauge
  • send_queue_length{peer} (Gauge)

The collector lives in pkg/multicluster/leaderelection/metrics.go, registered idempotently via RegisterTransport(t) after the transport's atomics are populated. Compile-time assertion (var _ prometheus.Collector = &transportCollector{}) ensures the interface stays implemented.

2. Reconcile-health metrics (operator_controller_*)

Wrapper-emitted automatically for every controller registered through observability.Wrap(reconciler, controller, defaultRequeueTimeout):

  • reconcile_steady_state_total{controller} — Counter, incremented when a reconcile returns "no work to do" — either (Result{}, nil) or (Result{RequeueAfter: defaultRequeueTimeout}, nil) matching the controller's configured periodic-requeue interval. The second shape is required because MulticlusterReconciler always returns RequeueAfter = periodicRequeue via a defer; without the dual-shape predicate the StretchCluster controller would never register as steady.
  • reconcile_last_success_timestamp_seconds{controller} — Gauge, Unix timestamp of the most recent steady-state reconcile. Prometheus computes "seconds since last success" at query time as time() - last_success_timestamp_seconds. Avoids the goroutine bookkeeping an imperative "seconds elapsed" gauge would need.

Wrap is generic over reconcile.TypedReconciler[R] so it covers both ctrl.Reconciler (single-cluster) and the multicluster reconciler (mcreconcile.Request) without duplicating the body.

Three controllers wrap: v2 Redpanda, NodePool, and StretchCluster. Two setup paths per controllerSetupWithManager (single-cluster binary, cmd/run) and SetupWithMultiClusterManager (multicluster binary, cmd/multicluster). Both paths wrap.

3. StretchCluster member-status metrics (operator_stretchcluster_*)

Per-member gauges, bounded cardinality (stretchcluster, member):

  • member_reachable — 0/1 from the multicluster manager's reachability probe. Local cluster is always 1, recorded under its canonical name via lifecycle.CanonicalClusterName.
  • brokers / brokers_ready — desired and ready broker counts per member, summed across NodePools pointing at that member. brokers - brokers_ready > 0 = partial outage.
  • replication_health{stretchcluster} — 0/1, cluster-wide from the admin API health check reconcileDecommission already runs.
  • spec_drift{stretchcluster, member} — 0/1, does each member's local StretchCluster.spec match the operator's view. Set inside checkSpecConsistency.

No new API calls — passive recorder, callers pass values they already have. MulticlusterReconciler is the only consumer; instrumented at three sites where the data already lives.

4. PrometheusRule (gated by monitoring.rulesEnabled chart value)

operator/chart/prometheusrule.go (transpiled to _prometheusrule.go.tpl by gotohelm). New monitoring.rulesEnabled is a sibling of monitoring.enabled (ServiceMonitor); independent so consumers can opt into rules without the ServiceMonitor.

Recording rules: operator:reconcile_rate:5m, operator:reconcile_error_rate:5m, operator:reconcile_steady_state_rate:5m, operator:reconcile_p99_seconds:5m.

Alerts, all severity=warning:

  • OperatorReconcileErrorsoperator:reconcile_error_rate:5m > 0.1 for 5m
  • OperatorReconcileRunawayoperator:reconcile_rate:5m > 5 for 5m (the canonical "spinning controller" signal — cross-checks steady_state_total)
  • OperatorReconcileStalled — active in the past hour but reconcile rate == 0 for 10m
  • OperatorWorkerPoolSaturatedactive_workers >= max_concurrent_reconciles for 10m
  • StretchClusterMemberUnreachable — 2m
  • StretchClusterBrokerCountSkew — 10m
  • StretchClusterSpecDrift — 5m
  • StretchClusterReplicationUnhealthy — 5m

5. Comprehensive Grafana dashboard (docs/operator-grafana-dashboard.json)

Single comprehensive dashboard, 5 rows: multicluster raft, StretchCluster member status, reconcile activity, queues & workers, reconcile-health signals.

Leader-only panels (Send latency p99, Follower match-lag entries) use and ignoring(...) to filter to the current leader's perspective — works in both dev-env (where remote_write adds a vcluster label) and prod (direct scrape with just instance).

6. Single source of truth for metric definitions

Every prometheus.New* call in the operator module now lives in operator/internal/observability/metrics.go (the v1 vectorized Cluster metrics in operator/internal/controller/vectorized/metric_controller.go are out of scope — v1 is legacy and explicitly not touched). Recorder helpers (RecordStretchCluster*) stay in stretch_recorder.go. The raft family stays in pkg/multicluster/leaderelection/metrics.go because that's a different Go module.

7. Documentation (docs/operator-metrics.md)

Canonical inventory of every metric the operator exposes. Cardinality table up front lists every label and its bounded vocabulary. Four groups: controller-runtime built-ins, reconcile-health, resource-state, multicluster raft.

8. Testing

  • Unit tests (wrapper_test.go): all four record-path branches — Result{} on a controller with defaultRequeueTimeout=0, Result{RequeueAfter: defaultRequeueTimeout} (periodic-steady), Result{RequeueAfter: other} (real requeue, not steady), errors, immediate-requeue. Plus passthrough.
  • Integration test (integration_test.go): TestIntegrationObservabilityInfiniteReconcile runs a synthetic reconciler inside a real controller-runtime Manager driven by envtest. Switches between spinning (RequeueAfter 100ms) and steady-state mid-test; asserts metrics react correctly. Gated by testutil.SkipIfNotIntegration + -tags integration.

Design notes worth flagging for reviewers

  • leader_id is deliberately not a numeric gauge — sum(leader_id) is meaningless and state{state="leader"} == 1 already identifies the leader on each peer.
  • state modelled as one series per state value (leader|follower|candidate|pre_candidate|unknown) with 0/1, not as a label-string-as-value. sum(state) == 1 invariant; state{state="leader"} == 1 is the standard leader filter.
  • Wrapper's isPeriodicRequeue returns false when defaultRequeueTimeout == 0 so a stray Result{Requeue: true} (RequeueAfter == 0) on a non-periodic controller doesn't accidentally register as periodic-steady. Plain Result{} still counts via the result.IsZero() branch.

@hidalgopl hidalgopl marked this pull request as draft May 11, 2026 08:43
@hidalgopl hidalgopl force-pushed the pb/multicluster-raft-metrics branch 2 times, most recently from fce0d26 to fc0d602 Compare May 12, 2026 08:46
@hidalgopl hidalgopl changed the base branch from pb/multicluster-operator-debug-bundle to main May 12, 2026 09:28
@hidalgopl hidalgopl force-pushed the pb/multicluster-raft-metrics branch from fc0d602 to 32963a4 Compare May 13, 2026 09:02
@hidalgopl hidalgopl changed the title multicluster: emit raft metrics for cross-cluster leader election operator: multicluster end-to-end observability — raft metrics, reconcile-health, StretchCluster member status, PrometheusRule, dashboard May 13, 2026
@hidalgopl hidalgopl marked this pull request as ready for review May 13, 2026 11:49
@hidalgopl hidalgopl force-pushed the pb/multicluster-raft-metrics branch 3 times, most recently from 83f91a2 to 55fee90 Compare May 13, 2026 12:21
hidalgopl and others added 8 commits May 13, 2026 14:38
…ter member status

Three slices of operator observability that flow into a single PrometheusRule and a single Grafana dashboard.

**Multicluster raft metrics.** The raft layer was operationally invisible — only structured logs and a `peer.dropCount` atomic surfaced via a periodic logger goroutine. New `operator_multicluster_raft_*` family registered to controller-runtime's metrics registry: `leader_changes_total`, `messages_{sent,received}_total{msg_type,peer}`, `send_errors_total{peer,error_type}` (closed six-value vocabulary: timeout/canceled/unavailable/auth/marshal/other), `messages_dropped_total{peer}`, `send_duration_seconds{peer,result}` (cross-region buckets 1ms..2.5s), `inflight_rpcs{peer}`, `peer_reachable{peer}`, `unreachable_reports_total{peer}`, `snapshots_sent_total{peer}`, `snapshot_send_errors_total{peer}`, and a leader-only `follower_match_lag_entries{peer}` (reads `node.Status().Progress`; followers keep prior value, federation expectation is "scrape from the leader"). Transport-backed gauges read existing atomics on scrape via `RegisterTransport(t)`: `term`, `state{state="leader|follower|candidate|pre_candidate|unknown"}` (one series per state with 0/1 value, no separate `is_leader` gauge), `send_queue_length{peer}`. `runDropLogger` and the `peer.dropCount` atomic are deleted — `messages_dropped_total{peer}` plus standard alerting covers it. `RegisterTransport` unregisters any prior transport collector before registering; safe no-op in prod (one transport per process), fixes test-ordering in `setupLockTest`.

**Reconcile-health metrics.** Every controller already emits the controller-runtime built-ins but there are no signals tuned for self-triggered loops, falling behind on spec, or non-determinism in spec-rendering. New `operator/internal/observability/` package adds `Wrap[R](inner reconcile.TypedReconciler[R], controller string)` middleware that emits `operator_controller_reconcile_steady_state_total{controller}` (incremented when the inner returned `(Result{}, nil)` — a controller whose `reconcile_total` rate is high but `steady_state` rate is flat is spinning) and `operator_controller_reconcile_requeue_after_seconds{controller}` (histogram of `Result.RequeueAfter`; tight cluster of sub-second values = retry loop). Generic over `reconcile.TypedReconciler[R]` so the same wrapper covers both `ctrl.Reconciler` and the multicluster reconciler. Two passive recorder helpers for per-object signals that need an object reference: `RecordObservedGeneration(controller, kind, gen, obsGen)` → `operator_controller_reconcile_observed_generation_drift` (clamps negative deltas to zero), and `RecordSpecHashChangedWithoutGeneration(controller, kind)` → `operator_controller_reconcile_spec_hash_changed_without_generation_total` (canonical non-determinism signal). Both leave it to the calling controller to decide when to record so the observability layer never duplicates API reads. Three controllers wrap at `SetupWithManager`: v2 Redpanda, v2 NodePool, and the multicluster StretchCluster reconciler. Other controllers (Console, vectorized v1, decommissioners, PVCUnbinder, NodeWatcher) keep their built-ins un-wrapped — they don't manage the resources in scope.

**StretchCluster member-status metrics.** Where `operator_controller_*` describes how the controllers behave, these describe what they're managing. All gauges, bounded label cardinality (`stretchcluster`, `member`): `operator_stretchcluster_member_reachable` (0/1 from the multicluster manager's reachability probe; local cluster always 1, recorded under its canonical name via `lifecycle.CanonicalClusterName` rather than the multicluster-runtime's empty-string sentinel); `operator_stretchcluster_brokers` / `operator_stretchcluster_brokers_ready` (desired and ready broker counts per member, summed across NodePools pointing at that member — gap on a single member = partial outage); `operator_stretchcluster_replication_health{stretchcluster}` (0/1, cluster-wide from the admin API check `reconcileDecommission` already makes, recorded right after the call returns); `operator_stretchcluster_spec_drift{stretchcluster, member}` (0/1, does each member's local StretchCluster.spec match the operator's view; set inside the existing `checkSpecConsistency` routine). No new API calls — passive recorder, callers pass values they already have. `MulticlusterReconciler` is the only consumer; instrumented at three sites where the data already lives (`checkSpecConsistency`, `reconcileDecommission`, and a new `recordBrokerCountMetrics` helper called once per reconcile after `fetchInitialState`). Unreachable members are recorded as `member_reachable=0` but `spec_drift` keeps its prior value because we genuinely don't know.

**Chart artifacts.** `operator/chart/prometheusrule.go` (transpiled to `_prometheusrule.go.tpl` by gotohelm) emits a PrometheusRule with recording rules and two alert groups — reconcile health (`OperatorReconcileErrors`, `OperatorReconcileRunaway`, `OperatorReconcileStalled`, `OperatorWorkerPoolSaturated`, `OperatorObservedGenerationDrift`, `OperatorNonDeterministicSpec`) and StretchCluster (`StretchClusterMemberUnreachable` 2m, `StretchClusterBrokerCountSkew` 10m, `StretchClusterSpecDrift` 5m, `StretchClusterReplicationUnhealthy` 5m). All severity `warning` — indicators that need eyes, not page-now incidents. New `values.monitoring.rulesEnabled` (default `false`) — sibling of `monitoring.enabled` (ServiceMonitor); independent so consumers can opt into rules without the ServiceMonitor. Chart test case `monitoring-rules-enabled` locks the output into the golden file. `docs/operator-grafana-dashboard.json` rewritten as a single comprehensive dashboard: 29 panels across 5 rows covering reconcile health, StretchCluster member status, and multicluster raft. `docs/operator-metrics.md` is the canonical inventory restructured into four groups (controller-runtime built-ins / reconcile-health / resource-state / multicluster raft); cardinality table up top lists every label and its closed vocabulary; explicit "PLANNED (not yet emitted)" subsection lists `self_triggered_total` / `time_since_last_success_seconds` so downstream dashboards don't break later.

**Design notes.** Plain `prometheus/client_golang`, not OTel — matches the existing pattern in `operator/internal/controller/vectorized/metric_controller.go`, `operator/cmd/version/version.go`, `operator/pkg/client/kgo_hooks.go`. `leader_id` is deliberately not a numeric gauge — `sum(leader_id)` is meaningless and `state{state="leader"}` already identifies the leader on each peer. `leader_changes_total` is incremented in the Ready loop on `leader != prevLeader && leader != 0`; scraping from the leader gives the cluster-wide leader-change count. `error_type` bucketing keeps cardinality bounded and each bucket maps to a different on-call story.

Tests cover `msgTypeLabel` and `normaliseRaftState` exhaustively, plus an integration test that brings up a 3-node cluster via the existing `setupLockTest` harness and asserts gauges reflect elected state and heartbeat counters accrue from natural traffic. `operator/internal/observability/wrapper_test.go` covers the steady-state counter, requeue-after histogram, observed-generation drift clamping, and the spec-hash counter. Chart golden file regenerated for `monitoring-rules-enabled`.
…ntegration test

Adds the two reconcile-health metrics that the existing
observability inventory documented as PLANNED / Reserved:

* operator_controller_reconcile_last_success_timestamp_seconds
  (gauge, wrapper-emitted). Set inside the existing
  steady-state branch via a one-line write of the current
  Unix timestamp. Bounded cardinality (controller label only).
  Prometheus computes the user-facing "seconds since last
  success" as time() - this_gauge — no goroutine bookkeeping
  or oldest-unfinished tracker required.

* operator_controller_reconcile_self_triggered_total (counter,
  opt-in via observability.RecordSelfTriggered). Increments
  when a controller has detected that its own write to an
  object will re-enqueue the same reconcile without any other
  observable effect. The wrapper deliberately does not
  increment this — that would require a redundant Get to hash
  before/after every reconcile, breaking the wrapper's
  "passive, no extra reads" design. Controllers opt in from
  their own write helpers where the pre/post-write state is
  already in hand.

The wrapper exposes nowUnix as a var so deterministic tests
can drive the gauge without clock skew.

Dashboard: the existing "Self-triggered reconciles" stat
panel drops its "reserved for future / renders N/A" language
and gains a real description pointing at RecordSelfTriggered.
A new full-width timeseries panel renders time() -
last_success_timestamp_seconds per controller as
seconds-since-last-success — climbing past natural re-queue
intervals means the controller is failing or spinning.

Tests:

* Unit tests cover the new wrapper gauge (advances on steady
  state, frozen on error and on RequeueAfter), and the
  RecordSelfTriggered helper.

* TestIntegrationObservabilityInfiniteReconcile drives the
  wrapper through a real controller-runtime Manager against
  an envtest apiserver. A test reconciler watches ConfigMaps
  and switches between spinning (RequeueAfter 100ms) and
  steady-state via an atomic. Asserts that the spinning phase
  keeps last_success_timestamp_seconds and
  steady_state_total at 0 while the requeue histogram fills,
  then validates recovery once the mode flips.
  testutil.SkipIfNotIntegration gates it behind -tags
  integration.

Docs: operator-metrics.md moves both metrics out of the
"Reserved (currently silent)" section into the live
wrapper / recorder tables. The Reserved section is now gone.

Changie entry under operator-Added-*.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First opt-in caller of observability.RecordSelfTriggered. The
MulticlusterReconciler's syncStatus loop pushes the local
StretchCluster.Status to every reachable peer via
Status().Update(). When the remote's existing status is
semantically identical to what we're about to write, the Update
still bumps resourceVersion and re-enqueues the StretchCluster
reconciler on that peer — the canonical infinite-reconcile shape.

apiequality.Semantic.DeepEqual is already imported and used
elsewhere in this file for the spec-drift check; reuse it here
for the status-equality probe. Recorded after a successful
Update so transient write errors don't pollute the counter.

The metric powers the dashboard's "Self-triggered reconciles"
stat panel — sustained non-zero rate is an indicator that the
syncStatus path should diff before writing rather than writing
unconditionally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stogram

The wrapper's previous steady-state predicate only matched
(Result{}, nil) — a clean "no work" return. But the
MulticlusterReconciler always returns RequeueAfter =
periodicRequeue via a defer (the canonical "wake me up
periodically" pattern), so it never entered the steady-state
branch. Result: last_success_timestamp_seconds and
steady_state_total stayed permanently empty for that
controller despite it being healthy.

Wrap now takes a third argument: defaultRequeueTimeout. The
record() branch is now:

  err == nil && (result.IsZero() || isPeriodicRequeue(result))

Both shapes count as steady state. isPeriodicRequeue returns
true when result.RequeueAfter exactly equals
defaultRequeueTimeout (and false when defaultRequeueTimeout
== 0, so a stray Result{Requeue: true} on a non-periodic
controller doesn't accidentally register).

The requeue-after histogram now skips the periodic value —
otherwise the periodic-wake samples would dominate every
bucket and bury the tight-retry-loop signal the histogram
exists to surface.

Three call sites updated to pass the right periodic value:

  * redpanda_controller.go:157   → periodicRequeue
  * nodepool_controller.go:155   → periodicRequeue
  * multicluster_controller.go   → defaultReconcileTimeout

Tests cover all four paths:

  * Result{} on a controller with defaultRequeueTimeout=0 → steady
  * Result{RequeueAfter: defaultRequeueTimeout} → steady
  * Result{RequeueAfter: other} → not steady, observed in histogram
  * Result{RequeueAfter: defaultRequeueTimeout} → NOT observed
    in histogram

Docs and recorder.go godoc updated to reflect the dual-shape
steady-state definition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A critical second pass on the operator_controller_* family
revealed three metrics that were defined and exported but had
zero call sites: passive opt-in helpers waiting for a
controller to wire them, which never happened. Same shape as
the self_triggered_total metric that was removed in an earlier
pass.

Removed metrics (with their helpers, tests, alerts, and
dashboard panels):

  * operator_controller_reconcile_self_triggered_total
    (was wired to a syncStatus deep-equal probe in
    multicluster_controller.go — that wiring is reverted here
    too; the canonical "spinning loop" pattern is already
    detected by rate(reconcile_total) > rate(steady_state_total)
    and the OperatorReconcileRunaway alert).

  * operator_controller_reconcile_observed_generation_drift
    (gauge + RecordObservedGeneration helper; alert
    OperatorObservedGenerationDrift; dashboard panel 21).
    Generation drift is detectable on a per-resource basis
    from status.observedGeneration directly if a future use
    case requires it.

  * operator_controller_reconcile_spec_hash_changed_without_generation_total
    (counter + RecordSpecHashChangedWithoutGeneration helper;
    alert OperatorNonDeterministicSpec; dashboard panel 22).
    A passive opt-in helper with zero call sites is dead code.

Net surface of the operator_controller_* family is now three
wrapper-emitted metrics — no opt-in helpers, no dead code:

  * reconcile_steady_state_total       (counter)
  * reconcile_requeue_after_seconds    (histogram)
  * reconcile_last_success_timestamp_seconds (gauge)

While here:

  * The Wrap call in multicluster_controller.go was passing
    defaultReconcileTimeout (2m, the per-reconcile context
    deadline) instead of periodicRequeue (3m, what the
    reconciler actually returns). That meant the wrapper's
    isPeriodicRequeue predicate never matched and
    last_success_timestamp_seconds stayed empty for the
    StretchCluster controller. Fixed.

  * Dashboard panels 20 and 24 expanded to full width to fill
    the gaps left by deleted panels 21 / 22. Version bumped
    to 6.

  * Docs (operator-metrics.md), changelog entries, and the
    chart's prometheusrule.go all reflect the reduced surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Metric definitions were spread across three files in the operator
module:

  * operator/internal/observability/recorder.go    (3 operator_controller_*)
  * operator/internal/observability/stretch_recorder.go (5 operator_stretchcluster_*)
  * operator/internal/controller/redpanda/metric_controller.go (4 v2 redpanda_*)

To answer "what metrics does this operator expose?" you had to
grep three files. Consolidating: every prometheus.New* call now
lives in operator/internal/observability/metrics.go, grouped
into three sections (reconcile-health, StretchCluster, Redpanda
v2 CR resource-state) with a single init() that registers all
12 metrics. The redpanda metric_controller.go reconciler imports
the exported vars from the observability package instead of
defining its own.

The v2 metric vars are now exported (Redpandas,
RedpandaDesiredNodes, RedpandaReadyNodes,
RedpandaMisconfiguredClusters) instead of package-private.

recorder.go is reduced to the package docstring (kept as the
canonical place for the package-level godoc).
stretch_recorder.go is reduced to the four RecordStretchCluster*
helper functions.

Out of scope:

  * v1 (vectorized.redpanda.com Cluster) metrics — that
    controller is legacy and explicitly excluded from
    unrelated changes. Its 4 metrics remain defined next to
    its reconciler.
  * Multicluster raft metrics in pkg/multicluster/leaderelection/
    metrics.go — different Go module; cross-module consolidation
    would either expose operator/internal/observability/ to all
    pkg/ consumers or pull controller-runtime deps into the pkg
    module.

No emitted metric names change. No alerts or dashboards affected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reconcile_requeue_after_seconds histogram was a triage-aid
metric distinguishing self-requeue churn from external-event-
driven churn, but in practice the primary spinning detector
(rate(reconcile_total) > 5 while rate(steady_state_total) == 0,
fired by OperatorReconcileRunaway) is sufficient and the
histogram's added diagnostic value rarely justified the
maintenance cost — 11 bucket counters per controller plus a
wrapper-side write per non-zero non-periodic RequeueAfter
return.

The wrapper retains its isPeriodicRequeue predicate because the
steady-state branch still needs it to recognise periodic-wake
returns as steady. The dashboard "Reconcile-health signals"
section now contains a single full-width panel
(time() - last_success_timestamp_seconds).

Net surface of operator_controller_* after this drop:

  * reconcile_steady_state_total       (counter)
  * reconcile_last_success_timestamp_seconds (gauge)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s, changie

Five small follow-up changes from a final review pass over the
observability work:

1. operator/internal/controller/redpanda/nodepool_controller.go —
   wrap NodePool with observability.Wrap in SetupWithMultiClusterManager.
   The multicluster binary (cmd/multicluster) calls this path; only the
   single-cluster binary's SetupWithManager was wrapping. Result:
   NodePool emitted controller-runtime built-in metrics in the
   multicluster binary but nothing from the operator_controller_*
   family — its dashboard panels were permanently empty. Found via
   dogfooding the dashboard against the dev env.

2. docs/operator-grafana-dashboard.json — fix two leader-only panel
   queries ("Send latency p99 (per peer, leader's view)" and "Follower
   match-lag entries (leader's view)"). The old form used
   `sum by (le, peer) (...) * on(instance) group_left() (leader==1)`,
   which drops `instance` in the aggregation and then tries to join on
   it. Replaced with `and ignoring(le, peer, result, state)
   (leader==1)` — set-intersection that joins on every shared identity
   label, works in both dev (with the dev-env `vcluster` label) and
   prod (direct scrape with just `instance`).

3. pkg/multicluster/leaderelection/metrics.go — compile-time assertion
   `var _ prometheus.Collector = &transportCollector{}` so a missing
   or signature-drifted Describe / Collect method fails the build
   instead of failing at runtime registration.

4. docs/operator-metrics.md — clean up stale references left over
   from the metric audit (mentions of generation drift /
   non-determinism in the Group 2 intro, the orphaned `kind` label in
   the cardinality table, the opt-in Record* helpers that no longer
   exist, and the dashboard cross-link framing).

5. .changes/unreleased/ — replace three incremental Added entries
   with one consolidated entry covering the whole observability story.
   The previous incremental entries described slices in commit-time
   order and the raft family (the original subject of this PR) had no
   dedicated entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hidalgopl hidalgopl force-pushed the pb/multicluster-raft-metrics branch from 55fee90 to f0c05ba Compare May 13, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant