Skip to content

feat: Node Total Power Monitoring - Track system-wide power consumption per compute node #81

@inureyes

Description

@inureyes

Overview

This is a tracking issue for implementing node-level total power monitoring across all supported hardware platforms in all-smi. The goal is to provide users with visibility into the total power consumption of each compute node, including chassis-level power where accessible.

Motivation

  • Power Efficiency Monitoring: Track and optimize power usage across GPU/NPU clusters
  • Capacity Planning: Understand power requirements for infrastructure scaling
  • Thermal Management: Correlate power consumption with thermal conditions
  • Cost Estimation: Enable accurate power cost calculations for workloads
  • Environmental Impact: Support sustainability reporting and carbon footprint tracking

Scope

Platforms to Support

Platform Power Source Notes
Apple Silicon powermetrics Combined Power (CPU + GPU + ANE)
NVIDIA nvidia-smi GPU power; IPMI/BMC for chassis
Intel Gaudi hl-smi NPU power consumption
Tenstorrent luwen API NPU power metrics
Rebellions rbln-stat NPU power consumption
Furiosa furiosa-smi NPU power metrics
x86 Servers IPMI/BMC/RAPL Chassis-level or CPU power

Implementation Approach

  1. Platform-Specific Collection

    • Use native tools for each platform (powermetrics, nvidia-smi, hl-smi, etc.)
    • For server-class hardware, explore IPMI/BMC interfaces for chassis power
    • Fall back to component-level aggregation where chassis power is unavailable
  2. Data Model

    • Add total_power_watts field to node-level metrics
    • Distinguish between chassis power (if available) and component sum
    • Include power source metadata in API responses
  3. Display & Export

    • Show total power in TUI view mode
    • Export as Prometheus metric all_smi_node_total_power_watts
    • Include power efficiency metrics where applicable

Limitations

  • Some platforms may only provide component-level power (GPU/NPU only)
  • Chassis-level power requires IPMI/BMC access (often requires elevated privileges)
  • Power accuracy varies by platform and sensor quality

Sub-issues

Implementation will be tracked in platform-specific sub-issues:


This is a tracking/meta issue. Individual implementation details will be in linked sub-issues.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions