-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Labels
device:nvidia-gpuNVIDIA GPU relatedNVIDIA GPU relatedpriority:mediumMedium priority issueMedium priority issuestatus:readyReady to be worked onReady to be worked ontype:enhancementNew feature or requestNew feature or request
Description
Problem / Background
NVIDIA has released DGX Spark, a desktop AI system based on the GB10 Grace Blackwell chip. This system uses Unified Memory Architecture (UMA) where CPU and GPU share the same physical memory, which is fundamentally different from traditional discrete GPUs with dedicated VRAM.
Current all-smi NVIDIA GPU monitoring assumes discrete GPUs with:
- Dedicated GPU memory (VRAM) separate from system RAM
device.memory_info()returning GPU-specific memory metrics- Clear distinction between
used_memoryandtotal_memoryfor the GPU
On UMA systems like DGX Spark:
- CPU and GPU share the same physical memory pool
- Traditional memory reporting concepts may not apply directly
- NVML may report memory differently or require different API calls
- Memory usage attribution between CPU and GPU workloads may differ
Affected Products
- NVIDIA DGX Spark (GB10 Grace Blackwell)
- Future Grace-based products with unified memory
- Similar architectures that NVIDIA may release
Proposed Solution
Phase 1: Investigation
-
Research NVML behavior on UMA systems
- Determine how
nvmlDeviceGetMemoryInfo()behaves on GB10 - Check if new NVML APIs exist for unified memory reporting
- Investigate
nvmlDeviceGetMemoryInfo_v2()and related functions
- Determine how
-
Identify detection mechanism
- How to detect if a GPU uses UMA vs discrete memory
- Check device properties, architecture flags, or memory type indicators
-
Review existing implementations
- Reference the
nvidia_jetson.rsimplementation which already handles integrated GPUs with shared memory - Consider patterns from Apple Silicon support where unified memory is used
- Reference the
Phase 2: Implementation
-
Add UMA detection logic
- Detect Grace Blackwell and similar UMA architectures
- Add appropriate flags/metadata to distinguish UMA devices
-
Implement appropriate memory reporting
- Handle shared memory pool reporting
- Consider adding new fields like
shared_memoryorunified_memory_total - Ensure
used_memoryandtotal_memoryremain meaningful
-
Update device details
- Add "Memory Type: Unified" or similar indicator
- Report relevant UMA-specific metrics if available
-
Handle edge cases
- Graceful fallback if NVML doesn't support certain queries
- Consistent behavior across different driver versions
Acceptance Criteria
- Document NVML behavior on DGX Spark / GB10 systems
- Implement detection for UMA-based NVIDIA GPUs
- Memory metrics are reported accurately and meaningfully for UMA systems
- Device details include memory architecture type (Discrete/Unified)
- No regression in existing discrete GPU support
- Unit tests cover UMA detection and reporting logic
- Documentation updated with UMA-specific notes
Technical Considerations
NVML API Research Areas
nvmlDeviceGetMemoryInfo()vsnvmlDeviceGetMemoryInfo_v2()nvmlDeviceGetArchitecture()- check for Blackwell/Grace identificationnvmlDeviceGetBrand()- may indicate DGX Spark- Memory bus type and width queries
Architecture Reference
Current relevant implementations:
/src/device/readers/nvidia.rs- Standard NVIDIA GPU reader using NVML/src/device/readers/nvidia_jetson.rs- Jetson reader handling integrated GPU with shared memory (uses tegrastats fallback)
Potential New Fields in GpuInfo
// Consider adding to device detail or as new fields
memory_type: Option<String>, // "Discrete", "Unified", "Shared"
unified_memory_total: Option<u64>, // Total unified memory poolGraceful Degradation
If NVML on UMA systems doesn't provide expected metrics:
- Fall back to system memory reporting (similar to Jetson approach)
- Use
/proc/meminfoor similar for unified memory systems - Log warnings for unsupported queries
Additional Context
Related Implementations
- NVIDIA Jetson (
nvidia_jetson.rs): Uses tegrastats and system memory fallback for integrated GPU - Apple Silicon (
apple.rs): Unified memory architecture with shared CPU/GPU memory pool
References
Hardware Access
- Testing requires access to actual DGX Spark hardware or equivalent GB10 system
- Consider adding mock device templates for testing without hardware
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
device:nvidia-gpuNVIDIA GPU relatedNVIDIA GPU relatedpriority:mediumMedium priority issueMedium priority issuestatus:readyReady to be worked onReady to be worked ontype:enhancementNew feature or requestNew feature or request