-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Labels
type:enhancementNew feature or requestNew feature or request
Description
Overview
This is a tracking issue for implementing node-level total power monitoring across all supported hardware platforms in all-smi. The goal is to provide users with visibility into the total power consumption of each compute node, including chassis-level power where accessible.
Motivation
- Power Efficiency Monitoring: Track and optimize power usage across GPU/NPU clusters
- Capacity Planning: Understand power requirements for infrastructure scaling
- Thermal Management: Correlate power consumption with thermal conditions
- Cost Estimation: Enable accurate power cost calculations for workloads
- Environmental Impact: Support sustainability reporting and carbon footprint tracking
Scope
Platforms to Support
| Platform | Power Source | Notes |
|---|---|---|
| Apple Silicon | powermetrics | Combined Power (CPU + GPU + ANE) |
| NVIDIA | nvidia-smi | GPU power; IPMI/BMC for chassis |
| Intel Gaudi | hl-smi | NPU power consumption |
| Tenstorrent | luwen API | NPU power metrics |
| Rebellions | rbln-stat | NPU power consumption |
| Furiosa | furiosa-smi | NPU power metrics |
| x86 Servers | IPMI/BMC/RAPL | Chassis-level or CPU power |
Implementation Approach
-
Platform-Specific Collection
- Use native tools for each platform (powermetrics, nvidia-smi, hl-smi, etc.)
- For server-class hardware, explore IPMI/BMC interfaces for chassis power
- Fall back to component-level aggregation where chassis power is unavailable
-
Data Model
- Add
total_power_wattsfield to node-level metrics - Distinguish between chassis power (if available) and component sum
- Include power source metadata in API responses
- Add
-
Display & Export
- Show total power in TUI view mode
- Export as Prometheus metric
all_smi_node_total_power_watts - Include power efficiency metrics where applicable
Limitations
- Some platforms may only provide component-level power (GPU/NPU only)
- Chassis-level power requires IPMI/BMC access (often requires elevated privileges)
- Power accuracy varies by platform and sensor quality
Sub-issues
Implementation will be tracked in platform-specific sub-issues:
- feat: macOS Apple Silicon - Total Power monitoring via powermetrics #82 - macOS Apple Silicon - Total Power via powermetrics
- feat: Add Chassis/Node-level monitoring with per-node power and BMC metrics #84 - Chassis/Node-level monitoring with per-node power and BMC metrics
- NVIDIA - GPU Power aggregation
- Intel Gaudi - NPU Power monitoring
- Other platforms (as needed)
This is a tracking/meta issue. Individual implementation details will be in linked sub-issues.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
type:enhancementNew feature or requestNew feature or request