Skip to content

feat: Add support for DGX Spark (GB10) and Unified Memory Architecture NVIDIA GPUs #80

@inureyes

Description

@inureyes

Problem / Background

NVIDIA has released DGX Spark, a desktop AI system based on the GB10 Grace Blackwell chip. This system uses Unified Memory Architecture (UMA) where CPU and GPU share the same physical memory, which is fundamentally different from traditional discrete GPUs with dedicated VRAM.

Current all-smi NVIDIA GPU monitoring assumes discrete GPUs with:

  • Dedicated GPU memory (VRAM) separate from system RAM
  • device.memory_info() returning GPU-specific memory metrics
  • Clear distinction between used_memory and total_memory for the GPU

On UMA systems like DGX Spark:

  • CPU and GPU share the same physical memory pool
  • Traditional memory reporting concepts may not apply directly
  • NVML may report memory differently or require different API calls
  • Memory usage attribution between CPU and GPU workloads may differ

Affected Products

  • NVIDIA DGX Spark (GB10 Grace Blackwell)
  • Future Grace-based products with unified memory
  • Similar architectures that NVIDIA may release

Proposed Solution

Phase 1: Investigation

  1. Research NVML behavior on UMA systems

    • Determine how nvmlDeviceGetMemoryInfo() behaves on GB10
    • Check if new NVML APIs exist for unified memory reporting
    • Investigate nvmlDeviceGetMemoryInfo_v2() and related functions
  2. Identify detection mechanism

    • How to detect if a GPU uses UMA vs discrete memory
    • Check device properties, architecture flags, or memory type indicators
  3. Review existing implementations

    • Reference the nvidia_jetson.rs implementation which already handles integrated GPUs with shared memory
    • Consider patterns from Apple Silicon support where unified memory is used

Phase 2: Implementation

  1. Add UMA detection logic

    • Detect Grace Blackwell and similar UMA architectures
    • Add appropriate flags/metadata to distinguish UMA devices
  2. Implement appropriate memory reporting

    • Handle shared memory pool reporting
    • Consider adding new fields like shared_memory or unified_memory_total
    • Ensure used_memory and total_memory remain meaningful
  3. Update device details

    • Add "Memory Type: Unified" or similar indicator
    • Report relevant UMA-specific metrics if available
  4. Handle edge cases

    • Graceful fallback if NVML doesn't support certain queries
    • Consistent behavior across different driver versions

Acceptance Criteria

  • Document NVML behavior on DGX Spark / GB10 systems
  • Implement detection for UMA-based NVIDIA GPUs
  • Memory metrics are reported accurately and meaningfully for UMA systems
  • Device details include memory architecture type (Discrete/Unified)
  • No regression in existing discrete GPU support
  • Unit tests cover UMA detection and reporting logic
  • Documentation updated with UMA-specific notes

Technical Considerations

NVML API Research Areas

  • nvmlDeviceGetMemoryInfo() vs nvmlDeviceGetMemoryInfo_v2()
  • nvmlDeviceGetArchitecture() - check for Blackwell/Grace identification
  • nvmlDeviceGetBrand() - may indicate DGX Spark
  • Memory bus type and width queries

Architecture Reference

Current relevant implementations:

  • /src/device/readers/nvidia.rs - Standard NVIDIA GPU reader using NVML
  • /src/device/readers/nvidia_jetson.rs - Jetson reader handling integrated GPU with shared memory (uses tegrastats fallback)

Potential New Fields in GpuInfo

// Consider adding to device detail or as new fields
memory_type: Option<String>,  // "Discrete", "Unified", "Shared"
unified_memory_total: Option<u64>,  // Total unified memory pool

Graceful Degradation

If NVML on UMA systems doesn't provide expected metrics:

  1. Fall back to system memory reporting (similar to Jetson approach)
  2. Use /proc/meminfo or similar for unified memory systems
  3. Log warnings for unsupported queries

Additional Context

Related Implementations

  • NVIDIA Jetson (nvidia_jetson.rs): Uses tegrastats and system memory fallback for integrated GPU
  • Apple Silicon (apple.rs): Unified memory architecture with shared CPU/GPU memory pool

References

Hardware Access

  • Testing requires access to actual DGX Spark hardware or equivalent GB10 system
  • Consider adding mock device templates for testing without hardware

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions