Skip to content

Conversation

@surajssd
Copy link
Member

@surajssd surajssd commented Sep 19, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

The DCGM Exporter provides GPU metrics collection and monitoring capabilities
for NVIDIA GPUs in Kubernetes clusters. This implementation adds the necessary
infrastructure to automatically install, configure, and manage Nvidia Device
Plugin & DCGM services on GPU-enabled nodes when the
EnableManagedGPUExperience tag is set.

  • Added support for 3 core DCGM packages:

    • datacenter-gpu-manager-4-core
    • datacenter-gpu-manager-4-proprietary
    • dcgm-exporter (Azure Linux) / datacenter-gpu-manager-exporter (Ubuntu) -
      Metrics exporter
  • Package Repository Setup: Implemented NVIDIA CUDA repository configuration for
    both Ubuntu and Azure Linux with support for x86_64 and ARM64 architectures

  • Systemd Service Configuration:

    • DCGM Exporter Service: Configures the exporter to run on port 19400
      (avoiding conflicts with user-installed instances on default port 9400)
    • Systemd Integration: Proper service management with custom drop-in
      configurations
    • nvidia-device-plugin, nvidia-dcgm, and nvidia-dcgm-exporter are managed as a
      cohesive unit
  • Feature Activation

    • Tag-Based Activation: Uses Azure instance metadata to check for
      EnableManagedGPUExperience tag
    • Conditional Installation: Only install, activate and enable DCGM components
      via systemd when explicitly enabled via nodepool tag.
  • AgentBaker e2e Testing & Validation

    • E2E Test Coverage: Added tests for Ubuntu 22.04, 24.04, and Azure Linux 3.0
      as a part of the existing device plugin tests.
    • Package Validation: Ensures correct package versions are installed
    • Service Validation: Verifies systemd services are running correctly
    • Metrics Validation: Confirms exporter is serving metrics on the correct port
      and returning GPU-specific metrics like DCGM_FI_DEV_GPU_UTIL
  • Automated Updates

    • Renovate Integration: Added custom datasources for NVIDIA package
      repositories.
    • Caveat: Renovate config is working correctly for the Ubuntu packages but
      not for the Azure Linux packages. So we may need to manage these
      semi-manually for now
  • Error Handling

    • Added specific error codes for DCGM-related failures
    • Improved package download logic with proper filename handling for special
      characters
  • Performance Optimizations

    • Packages are pre-downloaded during VHD build process and installed during
      the boot process using the CSE scripts.
    • Cleanup logic removes pre-downloaded artifacts after installation or if they
      are not needed.

Requirements:

"{\"releases\": $map(($index := releases#$i[version=\"Package: {{packageName}}\"].$i; $map($index, function($i) { $substringAfter(releases[$i + 1].version, \"Version: \") })), function($v) { {\"version\": $v} })[]}"
]
},
"nvidia-deb2404": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are assuming once released in Ubuntu repo it will be available in AzureLinux ?

Copy link
Member Author

@surajssd surajssd Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, I understand. What do you mean?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question was more around the fact that changes in this files were only for ubuntu, I was wondering is we needed to setup renovate datasource for the azurelinux repos.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that for the AzureLinux renovate uses the registry URL provided as a part of the components.json file to fetch new update.

"\"renovateTag\":\\s*\"RPM_registry=(?<registryUrl>[^,]+), name=(?<packageName>[^,]+), os=azurelinux, release=3\\.0\",\\s*\"latestVersion\":\\s*\"(?<currentValue>[^\"]+)\"(?:[^}]*\"previousLatestVersion\":\\s*\"(?<depType>[^\"]+)\")?"
],
"datasourceTemplate": "rpm",
"autoReplaceStringTemplate": "\"renovateTag\": \"RPM_registry={{{registryUrl}}}, name={{{packageName}}}, os=azurelinux, release=3.0\",\n \"latestVersion\": \"{{{newValue}}}\"{{#if depType}},\n \"previousLatestVersion\": \"{{{currentValue}}}\"{{/if}}"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but I might be wrong here. @Devinwong could confirm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes for RPM_registry, it will capture the URL provided in the components.json.
But I do have a suggestion here. As this PR involves multiple new rules and components added, it will be great to really test them out, if you haven't yet, to see if Renovate can really create PRs for them automatically, and ensure it doesn't break others (Renovate will complain with warnings/errors)

  • Create this branch in your fork repo and intentionally set the versions in components.json to an older one, e.g. from 4.4.1-1 to 4.4.0 and see if it really works. You will need to onboard your fork to https://developer.mend.io/ so that Renovate can detect your fork.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often break Renovate when I introduce new rules as Renovate highly relies on the renovate.json correctness and JSON can only be debugged at runtime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Devinwong the renovate config for the APT packages is working fine, the rpm config has proven problematic. I will sync with you office to move this forward.

@surajssd surajssd force-pushed the suraj/add-dcgm-exporter branch 2 times, most recently from 6024f09 to 0068862 Compare September 24, 2025 20:20
@surajssd surajssd force-pushed the suraj/add-dcgm-exporter branch from 0068862 to 0b76c23 Compare September 24, 2025 20:23
@surajssd surajssd force-pushed the suraj/add-dcgm-exporter branch from 0b76c23 to 7948be4 Compare September 25, 2025 20:55
@surajssd surajssd force-pushed the suraj/add-dcgm-exporter branch 2 times, most recently from 47cfb29 to bff227c Compare September 27, 2025 00:17
@surajssd surajssd force-pushed the suraj/add-dcgm-exporter branch from bff227c to d82b99a Compare September 29, 2025 17:24
@surajssd surajssd force-pushed the suraj/add-dcgm-exporter branch from d82b99a to 6cbae60 Compare September 29, 2025 18:07
@surajssd surajssd force-pushed the suraj/add-dcgm-exporter branch from 6cbae60 to 4110deb Compare October 1, 2025 16:10
@surajssd surajssd marked this pull request as ready for review October 2, 2025 17:00
@surajssd surajssd requested a review from juan-lee as a code owner October 2, 2025 17:00
Replace direct `apt-get download` with curl approach to handle package names
containing special characters. The `apt-get download` command encodes special
characters in package names but doesn't decode them when saving files to disk,
causing filename issues. This change uses `apt-get --print-uris` to get the
download URL, then uses `curl -fLJO` to download with proper filename handling.

Signed-off-by: Suraj Deshmukh <[email protected]>
- Add new error code `ERR_NVIDIA_DCGM_INSTALL_FAIL (227)` for DCGM installation
  failures
- Implement `dcgm_package_list()` function to define required DCGM packages
- Add `installNvidiaDCGMPkgFromCache()` for both Ubuntu and Mariner
  distributions
- Integrate DCGM installation into main GPU node setup flow with feature flag
- Update `cleanUpGPUDrivers()` to remove DCGM package directories during cleanup
- Support installation from cached .deb files (Ubuntu) and .rpm files (Mariner)

Signed-off-by: Suraj Deshmukh <[email protected]>
- Replace multiple if statements with a cleaner case statement for package
  testing
- Add support for datacenter-gpu-manager variants in testPkgDownloaded path
- Improve code readability and maintainability

Signed-off-by: Suraj Deshmukh <[email protected]>
- Fix typo: "locatl" -> "local" in updateDnfWithNvidiaPkg function
- Add dnf_makecache call after nvidia repo setup for Mariner
- Add missing benchmark capture for nvidia apt update in Ubuntu

Signed-off-by: Suraj Deshmukh <[email protected]>
Remove datacenter-gpu-manager-4-cuda12 and
datacenter-gpu-manager-4-proprietary-cuda12 packages from the build process.
This includes:

- Removing package definitions from components.json
- Updating package lists in cse_helpers.sh
- Removing download logic from install-dependencies.sh
- Updating VHD content tests

The core and proprietary variants without CUDA 12 suffix remain available.

Signed-off-by: Suraj Deshmukh <[email protected]>
- Add new function `startNvidiaDCGMExporterService()` to configure and start
  NVIDIA DCGM exporter with custom port (19400) to avoid conflicts
- Add `enable_managed_gpu_experience()` helper function to check instance
  metadata for `EnableManagedGPUExperience` tag
- Add new error codes for NVIDIA DCGM service failures and tag lookup errors
- Update main logic to conditionally install and start NVIDIA DCGM services
  based on nodepool tags
- Replace placeholder variable with proper implementation for managed GPU
  extensions

Signed-off-by: Suraj Deshmukh <[email protected]>
- Add test scenarios for Ubuntu 22.04 and 24.04 with NVIDIA DCGM Exporter
- Validate installation of datacenter-gpu-manager packages and exporter
- Add validators to check systemd services (nvidia-dcgm, nvidia-dcgm-exporter)
  are running
- Add validator to verify DCGM exporter metrics endpoint is scrapable on port
  19400
- Configure GPU-enabled VMs with Standard_NC6s_v3 and EnableManagedGPUExperience
  tag

Signed-off-by: Suraj Deshmukh <[email protected]>
Azure Linux package for the DCGM Exporter is called `dcgm-exporter` instead of
`datacenter-gpu-manager-exporter`.

This change updates the codebase to use 'dcgm-exporter' for Azure Linux while
maintaining 'datacenter-gpu-manager-exporter' for Ubuntu.

Changes:
- Make `getDCGMPackageNames()` OS-aware to return appropriate package names
- Move `dcgm_package_list()` function from common helpers to OS-specific scripts
- Update package download logic to handle dcgm-exporter for Azure Linux
- Update component versions to use Microsoft's Azure Linux repository
- Add dcgm-exporter to VHD content tests

Signed-off-by: Suraj Deshmukh <[email protected]>
Split the GPU manager exporter package handling to be OS-specific:
- Use 'datacenter-gpu-manager-exporter' for Ubuntu 22.04 and 24.04
- Use 'dcgm-exporter' for AzureLinux 3.0

This change ensures the correct package name is tested based on the
target operating system and version.

Signed-off-by: Suraj Deshmukh <[email protected]>
Strip epoch from package version (e.g., 1:4.4.1-1 -> 4.4.1-1) for all operating
systems, not just Ubuntu. This ensures consistent package version handling
across different OS types in the VHD content test.

Signed-off-by: Suraj Deshmukh <[email protected]>
- Add length validation when getting expected package versions to prevent index
  out of bounds errors
- Add validation for DCGM_FI_DEV_GPU_UTIL metric to ensure exporter is returning
  actual GPU metrics

Signed-off-by: Suraj Deshmukh <[email protected]>
Add dcgm-exporter component with download location matching the package name for
Azure Linux v3.0, this ensures that we don't have to write any custom code for
the non-matching folder name and package name.

Signed-off-by: Suraj Deshmukh <[email protected]>
When querying JSON paths with gjson, dots in the release version string (e.g.,
"v3.0") are interpreted as path separators rather than literal characters. This
causes incorrect path resolution when looking up package versions.

- Escape dots in release string before using in JSON path queries
- Update Azure Linux 3 test to use proper version "v3.0" in both device plugin
  and DCGM tests.
- Improve error message to include OS version for better debugging

Signed-off-by: Suraj Deshmukh <[email protected]>
Restrict NVIDIA DCGM package installation to only Azure Linux 3.0 by
adding an OS version check in `installNvidiaDCGMPkgFromCache()` This
ensures compatibility and prevents installation attempts on unsupported
versions.

Signed-off-by: Suraj Deshmukh <[email protected]>
Consolidate GPU device plugin and DCGM services under managed GPU experience.

- Moving nvidia-device-plugin installation and configuration from general GPU
  setup to managed GPU experience
- Renaming functions to reflect broader scope (startNvidiaDCGMExporterService ->
  startNvidiaManagedExpServices)
- Adding package installation skip logic for already installed packages
- Updating error constants and logging to be more generic
- Adding EnableManagedGPUExperience tag to all GPU test scenarios
- Consolidating package management functions across Ubuntu and AzureLinux

The managed GPU experience now handles nvidia-device-plugin, nvidia-dcgm, and
nvidia-dcgm-exporter as a cohesive unit, providing better control and
consistency for GPU workloads.

Signed-off-by: Suraj Deshmukh <[email protected]>
Rename NVIDIA DCGM Exporter test functions to use consistent underscore
naming convention between OS version and test description:
- `Test_Ubuntu2404NvidiaDCGMExporterRunning` ->
  `Test_Ubuntu2404_NvidiaDCGMExporterRunning`
- `Test_Ubuntu2204NvidiaDCGMExporterRunning` ->
  `Test_Ubuntu2204_NvidiaDCGMExporterRunning`
- `Test_AzureLinux3NvidiaDCGMExporterRunning` ->
  `Test_AzureLinux3_NvidiaDCGMExporterRunning`

Signed-off-by: Suraj Deshmukh <[email protected]>
- Rename `enableManagedGPUExperience()` to `enableManagedGPUExperience()`
- Rename `isPackageInstalled()` to `isPackageInstalled()` in both Mariner and
  Ubuntu scripts
- Update all function calls to use the new camelCase naming convention

This change improves code consistency by adopting a uniform naming convention
across the codebase.

Signed-off-by: Suraj Deshmukh <[email protected]>
- Move getDCGMPackageNames function from scenario_test.go to
  scenario_nvidia_device_plugin_test.go
- Remove separate DCGM Exporter test functions for Ubuntu 24.04, Ubuntu 22.04,
  and Azure Linux v3
- Integrate DCGM package validation and service checks into existing NVIDIA
  device plugin tests
- Update test descriptions to reflect combined validation of both device plugin
  and DCGM Exporter
- Refactor hardcoded OS/version strings to use variables for better
  maintainability

This consolidation reduces test duplication and provides comprehensive GPU
functionality validation in a single test per OS.

Signed-off-by: Suraj Deshmukh <[email protected]>
Rename scenario_nvidia_device_plugin_test.go to
scenario_gpu_managed_experience_test.go to better represent the comprehensive
GPU testing scenarios covered in the file.

Signed-off-by: Suraj Deshmukh <[email protected]>
- Restrict regex matching to specific DCGM packages
  (datacenter-gpu-manager-4-core, datacenter-gpu-manager-4-proprietary,
  datacenter-gpu-manager-exporter)
- Change repository from 'production' to 'nvidia' for both Ubuntu 22.04 and
  24.04 configurations
- Update descriptions to reflect DCGM-specific scope

Signed-off-by: Suraj Deshmukh <[email protected]>
Ignore AI-related configuration files to prevent them from being tracked in
version control.

Signed-off-by: Suraj Deshmukh <[email protected]>
Convert function name from snake_case to camelCase for consistency
with coding style conventions. Update all function calls in both
Mariner and Ubuntu CSE installation scripts.

Signed-off-by: Suraj Deshmukh <[email protected]>
Signed-off-by: Suraj Deshmukh <[email protected]>
@surajssd surajssd merged commit 3ffd233 into master Oct 10, 2025
41 of 42 checks passed
@surajssd surajssd deleted the suraj/add-dcgm-exporter branch October 10, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants