fix(engine): centralize telemetry timer management in runtime manager#2804
Draft
cijothomas wants to merge 7 commits intoopen-telemetry:mainfrom
Draft
fix(engine): centralize telemetry timer management in runtime manager#2804cijothomas wants to merge 7 commits intoopen-telemetry:mainfrom
cijothomas wants to merge 7 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2804 +/- ##
==========================================
- Coverage 86.14% 86.13% -0.01%
==========================================
Files 707 707
Lines 267469 267373 -96
==========================================
- Hits 230419 230315 -104
- Misses 36526 36534 +8
Partials 524 524
🚀 New features to boost your workflow:
|
b783a18 to
767831c
Compare
Fixes open-telemetry#1305 Previously, every node (receiver, processor, exporter) independently called effect_handler.start_periodic_telemetry(Duration::from_secs(1)) with a hardcoded 1-second interval. This was: - Not configurable by operators - Not enforceable (each node picked its own interval) - A significant contributor to idle CPU (~50 millicores on 4 cores) The runtime manager now registers telemetry timers for all nodes centrally during pipeline startup, using the configured engine.telemetry.reporting_interval. This: - Removes start_periodic_telemetry calls from all 15 node files - Eliminates per-node cancel handle management on shutdown - Enforces a single, consistent collection cadence by construction - Uses the existing configurable reporting_interval (default 1s) The idle test configuration is updated to use reporting_interval: 5s and a matching 5s Prometheus scrape interval, reducing idle CPU from ~0.9% to ~0.1% on 4 cores. Also fixes the idle-state-template Prometheus endpoint URLs to use the correct /api/v1 prefix.
767831c to
dbdd6d7
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR centralizes periodic node telemetry scheduling inside the engine runtime-control manager so pipelines use the configured engine-wide reporting interval instead of having each node start and cancel its own telemetry timer. It also updates the idle perf test to use a slower 5s telemetry/scrape cadence that better matches the new centralized behavior.
Changes:
- Pre-register telemetry timers for all nodes in the runtime-control manager and sync control-plane timer metrics immediately.
- Remove per-node
start_periodic_telemetry()/ cancel-handle management from receivers, processors, exporters, and validation code. - Add a dedicated idle perf-test engine config with
engine.telemetry.reporting_interval: 5sand align Prometheus scraping to 5s.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
tools/pipeline_perf_test/test_suites/integration/templates/configs/engine/continuous/otlp-attr-otlp-idle.yaml |
New idle perf-test engine config with 5s telemetry interval. |
tools/pipeline_perf_test/test_suites/integration/continuous/idle-state-template.yaml.j2 |
Switches idle test to the new config and sets 5s scrape interval. |
rust/otap-dataflow/crates/validation/src/validation_exporter.rs |
Removes exporter-local telemetry timer startup. |
rust/otap-dataflow/crates/engine/src/processor.rs |
Removes processor wrapper telemetry timer lifecycle. |
rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs |
Centralizes telemetry timer registration in the runtime manager and updates tests. |
rust/otap-dataflow/crates/engine/src/control.rs |
Adds helper to enumerate registered node IDs. |
rust/otap-dataflow/crates/core-nodes/src/receivers/topic_receiver/mod.rs |
Removes topic receiver telemetry timer/cancel handling. |
rust/otap-dataflow/crates/core-nodes/src/receivers/syslog_cef_receiver/mod.rs |
Removes syslog receiver telemetry timer/cancel handling. |
rust/otap-dataflow/crates/core-nodes/src/receivers/otlp_receiver/mod.rs |
Removes OTLP receiver telemetry timer/cancel plumbing. |
rust/otap-dataflow/crates/core-nodes/src/receivers/otap_receiver/mod.rs |
Removes OTAP receiver telemetry timer/cancel handling. |
rust/otap-dataflow/crates/core-nodes/src/receivers/internal_telemetry_receiver/mod.rs |
Removes internal telemetry receiver timer startup. |
rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs |
Removes fake generator telemetry timer startup. |
rust/otap-dataflow/crates/core-nodes/src/exporters/topic_exporter/mod.rs |
Removes topic exporter telemetry timer/cancel handling. |
rust/otap-dataflow/crates/core-nodes/src/exporters/perf_exporter/mod.rs |
Removes perf exporter custom telemetry timer startup. |
rust/otap-dataflow/crates/core-nodes/src/exporters/parquet_exporter/mod.rs |
Removes parquet exporter telemetry timer management and related test. |
rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_http_exporter/mod.rs |
Removes OTLP/HTTP exporter telemetry timer/cancel handling. |
rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_grpc_exporter/mod.rs |
Removes OTLP/gRPC exporter telemetry timer/cancel handling. |
rust/otap-dataflow/crates/core-nodes/src/exporters/otap_exporter/mod.rs |
Removes OTAP exporter telemetry timer/cancel handling. |
rust/otap-dataflow/crates/contrib-nodes/src/exporters/geneva_exporter/mod.rs |
Removes Geneva exporter telemetry timer/cancel handling. |
rust/otap-dataflow/crates/contrib-nodes/src/exporters/azure_monitor_exporter/exporter.rs |
Removes Azure Monitor exporter telemetry timer/cancel handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+324
to
+342
| // Register telemetry timers for all nodes centrally, using the | ||
| // configured reporting interval. This replaces per-node | ||
| // start_periodic_telemetry calls and ensures a single, consistent | ||
| // collection cadence across all nodes. | ||
| for node_id in result.control_senders.node_ids() { | ||
| result | ||
| .telemetry_timers | ||
| .start(node_id, result.control_plane_metrics_flush_interval); | ||
| } | ||
|
|
||
| // Sync the metrics shadow with the pre-registered timers so the | ||
| // `telemetry_timers.active` gauge reflects reality before the first | ||
| // scheduler tick instead of reporting 0 for one full reporting interval. | ||
| result.runtime_control_metrics.set_timer_counts( | ||
| result.tick_timers.timer_states.len(), | ||
| result.telemetry_timers.timer_states.len(), | ||
| ); | ||
|
|
||
| result |
…try-timers # Conflicts: # rust/otap-dataflow/crates/engine/src/processor.rs
Acknowledge the start-up race flagged in PR review (telemetry can queue ahead of Shutdown in a slow-starting node's bounded control channel) and defer the proper fix to a follow-up.
…ry_interval Address PR review feedback: `control_plane_metrics_flush_interval` was serving two unrelated purposes after centralization — both the manager's internal metrics-flush cadence and the per-node CollectTelemetry cadence. To accommodate this, the test constant had been bumped from 10 ms to 3600 s, which silently disabled coverage of the metrics-flush path. Split into two constructor parameters (`control_plane_metrics_flush_interval` unchanged, plus a new `node_telemetry_interval`). Production passes the same `engine.telemetry.reporting_interval` to both, preserving behavior. Tests can now keep `flush = 10 ms` (so the flush path is exercised) while pinning `node_telemetry = 3600 s` (so the pre-registered telemetry timers stay dormant during short tests). All 354 engine lib tests pass; clippy --workspace --all-targets clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1305
Centralizes telemetry timer registration in the runtime control manager so all nodes use the configured
engine.telemetry.reporting_intervalinstead of each callingstart_periodic_telemetry(Duration::from_secs(1))directly. Removes ~250 lines of per-node boilerplate across 15 nodes; shutdown cancellation handled centrally.Also bumps the idle perf test to
reporting_interval: 5s(with matching 5s Prometheus scrape) for a more realistic deployment baseline.Notes
perf_exporter.config.frequencyis now silently ignored; deprecation/cleanup is a follow-up.start_periodic_telemetryAPI is still public — TBD whether to remove.pipeline_ctrl.rsfor an eager-timer-registration race flagged in review (telemetry can queue ahead ofShutdownin a slow-starting node's bounded control channel); to be addressed via a node-ready signal in a follow-up.