fix(engine): centralize telemetry timer management in runtime manager by cijothomas · Pull Request #2804 · open-telemetry/otel-arrow

cijothomas · 2026-05-01T05:29:37Z

Centralizes telemetry timer registration in the runtime control manager so all nodes use the configured engine.telemetry.reporting_interval instead of each calling start_periodic_telemetry(Duration::from_secs(1)) directly. Removes ~250 lines of per-node boilerplate across 15 nodes; shutdown cancellation handled centrally.

Also bumps the idle perf test to reporting_interval: 5s (with matching 5s Prometheus scrape) for a more realistic deployment baseline.

Notes

perf_exporter.config.frequency is now silently ignored; deprecation/cleanup is a follow-up.
start_periodic_telemetry API is still public — TBD whether to remove.
TODO comment added in pipeline_ctrl.rs for an eager-timer-registration race flagged in review (telemetry can queue ahead of Shutdown in a slow-starting node's bounded control channel); to be addressed via a node-ready signal in a follow-up.

codecov · 2026-05-01T05:32:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.13%. Comparing base (be41f5c) to head (a95eeaa).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2804      +/-   ##
==========================================
- Coverage   86.14%   86.13%   -0.01%     
==========================================
  Files         707      707              
  Lines      267469   267373      -96     
==========================================
- Hits       230419   230315     -104     
- Misses      36526    36534       +8     
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`87.09% <100.00%> (-0.01%)`	⬇️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.73% <ø> (ø)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`92.25% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Fixes open-telemetry#1305 Previously, every node (receiver, processor, exporter) independently called effect_handler.start_periodic_telemetry(Duration::from_secs(1)) with a hardcoded 1-second interval. This was: - Not configurable by operators - Not enforceable (each node picked its own interval) - A significant contributor to idle CPU (~50 millicores on 4 cores) The runtime manager now registers telemetry timers for all nodes centrally during pipeline startup, using the configured engine.telemetry.reporting_interval. This: - Removes start_periodic_telemetry calls from all 15 node files - Eliminates per-node cancel handle management on shutdown - Enforces a single, consistent collection cadence by construction - Uses the existing configurable reporting_interval (default 1s) The idle test configuration is updated to use reporting_interval: 5s and a matching 5s Prometheus scrape interval, reducing idle CPU from ~0.9% to ~0.1% on 4 cores. Also fixes the idle-state-template Prometheus endpoint URLs to use the correct /api/v1 prefix.

Copilot

Pull request overview

This PR centralizes periodic node telemetry scheduling inside the engine runtime-control manager so pipelines use the configured engine-wide reporting interval instead of having each node start and cancel its own telemetry timer. It also updates the idle perf test to use a slower 5s telemetry/scrape cadence that better matches the new centralized behavior.

Changes:

Pre-register telemetry timers for all nodes in the runtime-control manager and sync control-plane timer metrics immediately.
Remove per-node start_periodic_telemetry() / cancel-handle management from receivers, processors, exporters, and validation code.
Add a dedicated idle perf-test engine config with engine.telemetry.reporting_interval: 5s and align Prometheus scraping to 5s.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tools/pipeline_perf_test/test_suites/integration/templates/configs/engine/continuous/otlp-attr-otlp-idle.yaml`	New idle perf-test engine config with 5s telemetry interval.
`tools/pipeline_perf_test/test_suites/integration/continuous/idle-state-template.yaml.j2`	Switches idle test to the new config and sets 5s scrape interval.
`rust/otap-dataflow/crates/validation/src/validation_exporter.rs`	Removes exporter-local telemetry timer startup.
`rust/otap-dataflow/crates/engine/src/processor.rs`	Removes processor wrapper telemetry timer lifecycle.
`rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs`	Centralizes telemetry timer registration in the runtime manager and updates tests.
`rust/otap-dataflow/crates/engine/src/control.rs`	Adds helper to enumerate registered node IDs.
`rust/otap-dataflow/crates/core-nodes/src/receivers/topic_receiver/mod.rs`	Removes topic receiver telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/receivers/syslog_cef_receiver/mod.rs`	Removes syslog receiver telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/receivers/otlp_receiver/mod.rs`	Removes OTLP receiver telemetry timer/cancel plumbing.
`rust/otap-dataflow/crates/core-nodes/src/receivers/otap_receiver/mod.rs`	Removes OTAP receiver telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/receivers/internal_telemetry_receiver/mod.rs`	Removes internal telemetry receiver timer startup.
`rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs`	Removes fake generator telemetry timer startup.
`rust/otap-dataflow/crates/core-nodes/src/exporters/topic_exporter/mod.rs`	Removes topic exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/exporters/perf_exporter/mod.rs`	Removes perf exporter custom telemetry timer startup.
`rust/otap-dataflow/crates/core-nodes/src/exporters/parquet_exporter/mod.rs`	Removes parquet exporter telemetry timer management and related test.
`rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_http_exporter/mod.rs`	Removes OTLP/HTTP exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_grpc_exporter/mod.rs`	Removes OTLP/gRPC exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/exporters/otap_exporter/mod.rs`	Removes OTAP exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/contrib-nodes/src/exporters/geneva_exporter/mod.rs`	Removes Geneva exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/contrib-nodes/src/exporters/azure_monitor_exporter/exporter.rs`	Removes Azure Monitor exporter telemetry timer/cancel handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        // Register telemetry timers for all nodes centrally, using the
+        // configured reporting interval. This replaces per-node
+        // start_periodic_telemetry calls and ensures a single, consistent
+        // collection cadence across all nodes.
+        for node_id in result.control_senders.node_ids() {
+            result
+                .telemetry_timers
+                .start(node_id, result.control_plane_metrics_flush_interval);
        }
+
+        // Sync the metrics shadow with the pre-registered timers so the
+        // `telemetry_timers.active` gauge reflects reality before the first
+        // scheduler tick instead of reporting 0 for one full reporting interval.
+        result.runtime_control_metrics.set_timer_counts(
+            result.tick_timers.timer_states.len(),
+            result.telemetry_timers.timer_states.len(),
+        );
+
+        result


…try-timers # Conflicts: # rust/otap-dataflow/crates/engine/src/processor.rs

Acknowledge the start-up race flagged in PR review (telemetry can queue ahead of Shutdown in a slow-starting node's bounded control channel) and defer the proper fix to a follow-up.

…ry_interval Address PR review feedback: `control_plane_metrics_flush_interval` was serving two unrelated purposes after centralization — both the manager's internal metrics-flush cadence and the per-node CollectTelemetry cadence. To accommodate this, the test constant had been bumped from 10 ms to 3600 s, which silently disabled coverage of the metrics-flush path. Split into two constructor parameters (`control_plane_metrics_flush_interval` unchanged, plus a new `node_telemetry_interval`). Production passes the same `engine.telemetry.reporting_interval` to both, preserving behavior. Tests can now keep `flush = 10 ms` (so the flush path is exercised) while pinning `node_telemetry = 3600 s` (so the pre-registered telemetry timers stay dormant during short tests). All 354 engine lib tests pass; clippy --workspace --all-targets clean.

…try-timers

github-project-automation Bot added this to OTel-Arrow May 1, 2026

github-actions Bot added the rust Pull requests that update Rust code label May 1, 2026

cijothomas force-pushed the fix/centralize-telemetry-timers branch from b783a18 to 767831c Compare May 1, 2026 17:21

cijothomas force-pushed the fix/centralize-telemetry-timers branch from 767831c to dbdd6d7 Compare May 1, 2026 18:08

cijothomas added 2 commits May 1, 2026 15:45

fix(engine): sync telemetry_timers.active gauge after pre-registration

d7d142e

Merge branch 'main' into fix/centralize-telemetry-timers

f8ee74d

cijothomas requested review from Copilot May 1, 2026 23:33

Copilot started reviewing on behalf of cijothomas May 1, 2026 23:34 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

cijothomas and others added 4 commits May 6, 2026 01:04

Merge remote-tracking branch 'origin/main' into fix/centralize-teleme…

2e7a7e7

…try-timers # Conflicts: # rust/otap-dataflow/crates/engine/src/processor.rs

chore(engine): TODO on eager telemetry timer registration

49ed814

Acknowledge the start-up race flagged in PR review (telemetry can queue ahead of Shutdown in a slow-starting node's bounded control channel) and defer the proper fix to a follow-up.

Merge remote-tracking branch 'origin/main' into fix/centralize-teleme…

a95eeaa

…try-timers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): centralize telemetry timer management in runtime manager#2804

fix(engine): centralize telemetry timer management in runtime manager#2804
cijothomas wants to merge 7 commits intoopen-telemetry:mainfrom
cijothomas:fix/centralize-telemetry-timers

cijothomas commented May 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cijothomas commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cijothomas commented May 1, 2026 •

edited

Loading

codecov Bot commented May 1, 2026 •

edited

Loading