-
Couldn't load subscription status.
- Fork 4
feat: GPU Telemetry Realtime Dashboard Display #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe PR introduces comprehensive GPU telemetry support to the aiperf system by adding new configuration fields for telemetry modes and URLs, defining real-time telemetry message types and commands, creating UI dashboard components for GPU metrics visualization, establishing telemetry hooks and mixins, and updating telemetry managers and record processors to track and publish real-time metrics alongside traditional telemetry collection. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant RecordsManager
participant TelemetryManager
participant Processor
participant Dashboard
participant Hooks
User->>RecordsManager: START_REALTIME_TELEMETRY
RecordsManager->>RecordsManager: Switch mode to REALTIME_DASHBOARD
RecordsManager->>TelemetryManager: Collect telemetry
TelemetryManager->>Processor: Process metrics
Processor->>Processor: Format MetricResult entries
Processor->>RecordsManager: Return metrics
RecordsManager->>RecordsManager: _generate_realtime_telemetry_metrics()
RecordsManager->>Hooks: Trigger ON_REALTIME_TELEMETRY_METRICS
Hooks->>Dashboard: on_realtime_telemetry_metrics(metrics)
Dashboard->>Dashboard: Update GPU tables per GPU
Dashboard->>User: Display real-time metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The changes span multiple architectural layers with significant logic additions to telemetry processing and UI integration, balanced by cohesive grouping around the telemetry feature. Key complexity drivers include: (1) multi-file coordination across enums, messages, hooks, and mixins; (2) state management in TelemetryManager with new collector_id_to_url mapping; (3) comprehensive refactor of records_manager with new reporting tasks and command handling; (4) three new UI component classes with nested helper methods; (5) integration points requiring cross-subsystem understanding. However, changes are relatively homogeneous in intent (telemetry feature additions) despite broad file distribution. Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
docs/tutorials/gpu-telemetry.md (1)
144-144: Consider documenting the purpose of HTML comment markers.The section delimiters (e.g.,
<!-- setup-vllm-gpu-telemetry-default-openai-endpoint-server -->,<!-- /setup-vllm-gpu-telemetry-default-openai-endpoint-server -->) appear intentional, likely for programmatic documentation generation or test orchestration. Consider adding a brief comment block at the top of the file explaining their purpose and any downstream tooling that depends on them, so future maintainers understand why they exist.Also applies to: 208-208, 251-251, 261-261, 265-265, 286-286
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (2)
41-43: Annotate mutable class attributes asClassVar(or make them tuples).
STATS_FIELDSandCOLUMNSare shared, mutable class attributes flagged by Ruff (RUF012). Mark them asClassVarand switch to immutable tuples to avoid accidental per-instance mutation.Apply this diff:
-from rich.text import Text +from typing import ClassVar + +from rich.text import Text @@ - STATS_FIELDS = ["avg", "min", "max", "p99", "p90", "p50", "std"] - COLUMNS = ["Metric", *STATS_FIELDS] + STATS_FIELDS: ClassVar[tuple[str, ...]] = ("avg", "min", "max", "p99", "p90", "p50", "std") + COLUMNS: ClassVar[tuple[str, ...]] = ("Metric", *STATS_FIELDS)
112-120: Narrow the exception handling when updating cells.Catching bare
Exceptionhides legitimate errors.DataTable.update_cellmainly raisesRowDoesNotExistorColumnDoesNotExist; handle those explicitly so real bugs surface.Apply this diff:
-from textual.widgets.data_table import ColumnKey, RowDoesNotExist, RowKey +from textual.widgets.data_table import ColumnDoesNotExist, ColumnKey, RowDoesNotExist, RowKey @@ - for col_name, cell_value in zip(self.COLUMNS, row_cells, strict=True): - try: - self.data_table.update_cell( # type: ignore - row_key, self._column_keys[col_name], cell_value, update_width=True - ) - except Exception as e: + for col_name, cell_value in zip(self.COLUMNS, row_cells, strict=True): + try: + self.data_table.update_cell( # type: ignore + row_key, self._column_keys[col_name], cell_value, update_width=True + ) + except (RowDoesNotExist, ColumnDoesNotExist) as e: _logger.warning( f"Error updating cell {col_name} with value {cell_value}: {e!r}" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (27)
aiperf/common/config/user_config.py(2 hunks)aiperf/common/enums/__init__.py(2 hunks)aiperf/common/enums/command_enums.py(1 hunks)aiperf/common/enums/message_enums.py(1 hunks)aiperf/common/enums/telemetry_enums.py(1 hunks)aiperf/common/hooks.py(2 hunks)aiperf/common/messages/__init__.py(2 hunks)aiperf/common/messages/telemetry_messages.py(3 hunks)aiperf/common/mixins/__init__.py(2 hunks)aiperf/common/mixins/realtime_telemetry_metrics_mixin.py(1 hunks)aiperf/exporters/csv_exporter.py(0 hunks)aiperf/exporters/json_exporter.py(0 hunks)aiperf/gpu_telemetry/telemetry_data_collector.py(0 hunks)aiperf/gpu_telemetry/telemetry_manager.py(4 hunks)aiperf/post_processors/telemetry_results_processor.py(3 hunks)aiperf/records/records_manager.py(10 hunks)aiperf/ui/__init__.py(3 hunks)aiperf/ui/base_ui.py(1 hunks)aiperf/ui/dashboard/__init__.py(2 hunks)aiperf/ui/dashboard/aiperf_dashboard_ui.py(1 hunks)aiperf/ui/dashboard/aiperf_textual_app.py(7 hunks)aiperf/ui/dashboard/realtime_telemetry_dashboard.py(1 hunks)docs/tutorials/gpu-telemetry.md(5 hunks)tests/ci/test_docs_end_to_end/test_runner.py(3 hunks)tests/gpu_telemetry/test_telemetry_manager.py(13 hunks)tests/post_processors/test_telemetry_results_processor.py(1 hunks)tests/records/test_records_filtering.py(2 hunks)
💤 Files with no reviewable changes (3)
- aiperf/exporters/csv_exporter.py
- aiperf/exporters/json_exporter.py
- aiperf/gpu_telemetry/telemetry_data_collector.py
🧰 Additional context used
🧬 Code graph analysis (19)
aiperf/ui/dashboard/__init__.py (1)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)
GPUMetricsTable(21-154)RealtimeTelemetryDashboard(248-313)SingleNodeView(157-245)
aiperf/ui/dashboard/aiperf_dashboard_ui.py (4)
aiperf/common/mixins/hooks_mixin.py (1)
attach_hook(96-114)aiperf/common/hooks.py (2)
AIPerfHook(41-57)on_realtime_telemetry_metrics(352-364)aiperf/ui/dashboard/aiperf_textual_app.py (1)
on_realtime_telemetry_metrics(273-277)aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1)
on_realtime_telemetry_metrics(304-313)
tests/post_processors/test_telemetry_results_processor.py (3)
tests/data_exporters/conftest.py (1)
sample_telemetry_record(17-38)aiperf/common/models/telemetry_models.py (2)
TelemetryRecord(69-107)TelemetryHierarchy(270-320)aiperf/post_processors/telemetry_results_processor.py (3)
TelemetryResultsProcessor(23-99)process_telemetry_record(39-45)get_telemetry_hierarchy(31-37)
aiperf/post_processors/telemetry_results_processor.py (3)
aiperf/exporters/display_units_utils.py (1)
normalize_endpoint_display(17-36)aiperf/common/models/telemetry_models.py (4)
TelemetryHierarchy(270-320)TelemetryRecord(69-107)add_record(240-251)add_record(293-320)aiperf/common/models/record_models.py (1)
MetricResult(25-51)
aiperf/common/enums/telemetry_enums.py (1)
aiperf/common/enums/base_enums.py (1)
CaseInsensitiveStrEnum(10-49)
aiperf/common/enums/__init__.py (1)
aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)
aiperf/ui/base_ui.py (2)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1)
RealtimeTelemetryMetricsMixin(15-43)aiperf/common/mixins/realtime_metrics_mixin.py (1)
RealtimeMetricsMixin(15-34)
aiperf/gpu_telemetry/telemetry_manager.py (1)
aiperf/common/messages/telemetry_messages.py (1)
TelemetryRecordsMessage(20-45)
aiperf/common/config/user_config.py (2)
aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)aiperf/common/config/cli_parameter.py (1)
DisableCLI(22-29)
aiperf/ui/__init__.py (2)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)
GPUMetricsTable(21-154)RealtimeTelemetryDashboard(248-313)SingleNodeView(157-245)aiperf/ui/dashboard/custom_widgets.py (2)
MaximizableWidget(15-32)NonFocusableDataTable(10-12)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (6)
aiperf/common/enums/message_enums.py (1)
MessageType(7-51)aiperf/common/hooks.py (3)
AIPerfHook(41-57)on_message(313-334)provides_hooks(206-228)aiperf/common/messages/telemetry_messages.py (1)
RealtimeTelemetryMetricsMessage(79-86)aiperf/common/mixins/message_bus_mixin.py (1)
MessageBusClientMixin(34-141)aiperf/common/models/record_models.py (1)
MetricResult(25-51)aiperf/common/aiperf_logger.py (1)
AIPerfLogger(25-239)
aiperf/common/messages/__init__.py (1)
aiperf/common/messages/telemetry_messages.py (2)
RealtimeTelemetryMetricsMessage(79-86)StartRealtimeTelemetryCommand(89-98)
aiperf/common/hooks.py (2)
aiperf/ui/dashboard/aiperf_textual_app.py (1)
on_realtime_telemetry_metrics(273-277)aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1)
on_realtime_telemetry_metrics(304-313)
aiperf/records/records_manager.py (7)
aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)aiperf/common/messages/telemetry_messages.py (2)
RealtimeTelemetryMetricsMessage(79-86)StartRealtimeTelemetryCommand(89-98)aiperf/exporters/display_units_utils.py (1)
normalize_endpoint_display(17-36)aiperf/common/models/error_models.py (1)
ErrorDetails(11-47)aiperf/gpu_telemetry/telemetry_manager.py (2)
TelemetryManager(40-378)_normalize_dcgm_url(106-118)aiperf/post_processors/telemetry_results_processor.py (2)
summarize(47-99)get_telemetry_hierarchy(31-37)aiperf/common/models/telemetry_models.py (1)
TelemetryResults(323-348)
aiperf/ui/dashboard/aiperf_textual_app.py (4)
aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)aiperf/common/messages/telemetry_messages.py (1)
StartRealtimeTelemetryCommand(89-98)aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)
RealtimeTelemetryDashboard(248-313)set_status_message(292-302)on_realtime_telemetry_metrics(304-313)aiperf/common/hooks.py (1)
on_realtime_telemetry_metrics(352-364)
tests/gpu_telemetry/test_telemetry_manager.py (1)
tests/data_exporters/test_gpu_telemetry_console_exporter.py (1)
mock_user_config(29-33)
aiperf/common/mixins/__init__.py (1)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1)
RealtimeTelemetryMetricsMixin(15-43)
aiperf/common/messages/telemetry_messages.py (7)
aiperf/common/enums/command_enums.py (1)
CommandType(7-17)aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)aiperf/common/enums/message_enums.py (1)
MessageType(7-51)aiperf/common/messages/command_messages.py (1)
CommandMessage(57-97)aiperf/common/models/error_models.py (1)
ErrorDetails(11-47)aiperf/common/models/record_models.py (1)
MetricResult(25-51)aiperf/common/models/telemetry_models.py (2)
ProcessTelemetryResult(351-362)TelemetryRecord(69-107)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (5)
aiperf/common/aiperf_logger.py (1)
AIPerfLogger(25-239)aiperf/common/config/service_config.py (1)
ServiceConfig(30-171)aiperf/common/models/record_models.py (1)
MetricResult(25-51)aiperf/ui/dashboard/custom_widgets.py (2)
MaximizableWidget(15-32)NonFocusableDataTable(10-12)aiperf/ui/dashboard/aiperf_textual_app.py (2)
compose(115-148)on_realtime_telemetry_metrics(273-277)
🪛 Ruff (0.14.0)
aiperf/common/config/user_config.py
244-244: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py
41-41: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
42-42: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
117-117: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (11)
docs/tutorials/gpu-telemetry.md (3)
37-38: Well-documented default endpoint behavior.The IMPORTANT note clearly explains that the default endpoint
http://localhost:9401/metricsis always attempted regardless of the--gpu-telemetryflag. This is well-emphasized and reinforced again in the multi-node example (lines 294, 316). The clarification is critical to user understanding and is properly positioned.
180-206: Port forwarding and DCGM exporter configuration is clearly explained.The setup instructions for the DCGM exporter are well-structured with clear explanations of port forwarding (
-p 9401:9400), metrics collection interval, and custom metrics file mounting. The example commands are syntactically correct and production-ready.
31-35: Table effectively communicates flag behavior.The comparison table succinctly explains the differences between no flag, flag only, and custom URLs, making it easy for users to understand which endpoints are collected and displayed in each scenario.
aiperf/ui/dashboard/__init__.py (1)
34-38: LGTM!The new dashboard components (GPUMetricsTable, RealtimeTelemetryDashboard, SingleNodeView) are correctly imported and exported. The changes follow the standard mkinit pattern for exposing public API surface.
Also applies to: 55-55, 63-63, 65-65
aiperf/ui/__init__.py (1)
20-20: LGTM!The telemetry dashboard components are correctly re-exported from the dashboard submodule to the top-level UI package API. This follows the standard mkinit-generated pattern.
Also applies to: 28-28, 30-30, 52-52, 62-62, 64-64
aiperf/common/enums/command_enums.py (1)
17-17: LGTM!The new START_REALTIME_TELEMETRY command type follows the existing naming conventions and is properly integrated into the CommandType enum.
aiperf/ui/base_ui.py (1)
7-7: LGTM!The RealtimeTelemetryMetricsMixin is correctly integrated into the BaseAIPerfUI class hierarchy. The mixin follows the same pattern as the existing RealtimeMetricsMixin and provides the necessary hooks for handling real-time telemetry metrics. The docstring updates accurately reflect the new capability.
Also applies to: 12-17, 21-23, 25-27
aiperf/common/enums/message_enums.py (1)
44-44: LGTM!The new REALTIME_TELEMETRY_METRICS message type is correctly added to the MessageType enum, following the established naming conventions and alphabetical ordering.
aiperf/ui/dashboard/aiperf_dashboard_ui.py (1)
73-76: LGTM!The ON_REALTIME_TELEMETRY_METRICS hook attachment follows the established pattern for hook registration in the dashboard UI. The integration correctly wires the telemetry metrics flow from the mixin to the textual app's handler.
aiperf/common/mixins/__init__.py (1)
44-46: LGTM!The RealtimeTelemetryMetricsMixin is correctly imported and exported following the standard mkinit pattern. The alphabetical ordering is maintained.
Also applies to: 69-69
tests/records/test_records_filtering.py (1)
352-358: LGTM!The test updates correctly reflect the new
dcgm_urlparameter added toTelemetryRecordsMessage. The test values are consistent and appropriate for the testing scenarios.Also applies to: 386-392
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds real-time GPU telemetry visualization dashboard to monitor GPU metrics during benchmark execution, introducing a new interactive TUI dashboard that displays live GPU metrics (utilization, memory, temperature, power).
- Real-time GPU telemetry dashboard with --gpu-telemetry 'dashboard' flag
- New UI components and data streaming infrastructure for telemetry visualization
- Enhanced telemetry management with endpoint tracking and performance optimization
Reviewed Changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| aiperf/ui/dashboard/realtime_telemetry_dashboard.py | New interactive TUI dashboard for real-time GPU metrics visualization |
| aiperf/ui/dashboard/aiperf_textual_app.py | Integrates telemetry dashboard into main UI with mode switching |
| aiperf/records/records_manager.py | Adds telemetry data streaming and background task management |
| aiperf/post_processors/telemetry_results_processor.py | Enhances telemetry processor with hierarchy access and display formatting |
| aiperf/gpu_telemetry/telemetry_manager.py | Updates telemetry manager with URL tracking and message enrichment |
| aiperf/common/config/user_config.py | Adds GPU telemetry mode configuration parsing |
| tests/ | Updates test files with new dcgm_url field and configuration changes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
d24a8fd to
0cd8551
Compare
7911183 to
c17e2ae
Compare
0453e7f to
47a2e8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Halfway through, but it looks really well done so far! Great work!
47a2e8e to
ecf615b
Compare
e56a660 to
bdc253d
Compare
…messages in telemetry manager
bdc253d to
7929399
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! Have some questions and comments, not sure much anything needs to change though. Great work!
| if gpu_key not in gpus: | ||
| gpus[gpu_key] = [] | ||
| gpus[gpu_key].append(metric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if gpu_key not in gpus: | |
| gpus[gpu_key] = [] | |
| gpus[gpu_key].append(metric) | |
| gpus.setdefault(gpu_key, []).append(metric) |
| if not self._columns_initialized: | ||
| self._initialize_columns() | ||
|
|
||
| gpu_metrics = [m for m in metrics if self._is_for_this_gpu(m)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already being filtered by the caller, isn't it?
| async def _generate_realtime_telemetry_metrics(self) -> list[MetricResult]: | ||
| """Generate the real-time GPU telemetry metrics.""" | ||
| telemetry_results = await asyncio.gather( | ||
| *[ | ||
| results_processor.summarize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, Ok, I see why it is not being sent as structured data (as per some of my below comments will talk about). So you may be limited in your ability to easily change things.
Adds real-time GPU telemetry visualization dashboard to monitor GPU metrics during benchmark execution.
Key Additions:
Technical Changes
Design Decisions
Performance Analysis:
Rationale: Trading 0.1% CPU overhead for eliminating 3-5 second wait times provides significantly better UX. The performance cost is negligible while the user experience benefit is substantial.
Documentation
Example:

Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation