-
Notifications
You must be signed in to change notification settings - Fork 235
Add NVIDIA DCGM packages and repository support #7063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
95075c5 to
3872315
Compare
3872315 to
b5bad89
Compare
b5bad89 to
a851961
Compare
a851961 to
1f95763
Compare
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh
Outdated
Show resolved
Hide resolved
| "{\"releases\": $map(($index := releases#$i[version=\"Package: {{packageName}}\"].$i; $map($index, function($i) { $substringAfter(releases[$i + 1].version, \"Version: \") })), function($v) { {\"version\": $v} })[]}" | ||
| ] | ||
| }, | ||
| "nvidia-deb2404": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we are assuming once released in Ubuntu repo it will be available in AzureLinux ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, I understand. What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my question was more around the fact that changes in this files were only for ubuntu, I was wondering is we needed to setup renovate datasource for the azurelinux repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that for the AzureLinux renovate uses the registry URL provided as a part of the components.json file to fetch new update.
AgentBaker/.github/renovate.json
Lines 625 to 628 in b805330
| "\"renovateTag\":\\s*\"RPM_registry=(?<registryUrl>[^,]+), name=(?<packageName>[^,]+), os=azurelinux, release=3\\.0\",\\s*\"latestVersion\":\\s*\"(?<currentValue>[^\"]+)\"(?:[^}]*\"previousLatestVersion\":\\s*\"(?<depType>[^\"]+)\")?" | |
| ], | |
| "datasourceTemplate": "rpm", | |
| "autoReplaceStringTemplate": "\"renovateTag\": \"RPM_registry={{{registryUrl}}}, name={{{packageName}}}, os=azurelinux, release=3.0\",\n \"latestVersion\": \"{{{newValue}}}\"{{#if depType}},\n \"previousLatestVersion\": \"{{{currentValue}}}\"{{/if}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, but I might be wrong here. @Devinwong could confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes for RPM_registry, it will capture the URL provided in the components.json.
But I do have a suggestion here. As this PR involves multiple new rules and components added, it will be great to really test them out, if you haven't yet, to see if Renovate can really create PRs for them automatically, and ensure it doesn't break others (Renovate will complain with warnings/errors)
- Create this branch in your fork repo and intentionally set the versions in components.json to an older one, e.g. from
4.4.1-1to4.4.0and see if it really works. You will need to onboard your fork to https://developer.mend.io/ so that Renovate can detect your fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often break Renovate when I introduce new rules as Renovate highly relies on the renovate.json correctness and JSON can only be debugged at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Devinwong the renovate config for the APT packages is working fine, the rpm config has proven problematic. I will sync with you office to move this forward.
6024f09 to
0068862
Compare
0068862 to
0b76c23
Compare
0b76c23 to
7948be4
Compare
47cfb29 to
bff227c
Compare
bff227c to
d82b99a
Compare
d82b99a to
6cbae60
Compare
6cbae60 to
4110deb
Compare
Replace direct `apt-get download` with curl approach to handle package names containing special characters. The `apt-get download` command encodes special characters in package names but doesn't decode them when saving files to disk, causing filename issues. This change uses `apt-get --print-uris` to get the download URL, then uses `curl -fLJO` to download with proper filename handling. Signed-off-by: Suraj Deshmukh <[email protected]>
- Add new error code `ERR_NVIDIA_DCGM_INSTALL_FAIL (227)` for DCGM installation failures - Implement `dcgm_package_list()` function to define required DCGM packages - Add `installNvidiaDCGMPkgFromCache()` for both Ubuntu and Mariner distributions - Integrate DCGM installation into main GPU node setup flow with feature flag - Update `cleanUpGPUDrivers()` to remove DCGM package directories during cleanup - Support installation from cached .deb files (Ubuntu) and .rpm files (Mariner) Signed-off-by: Suraj Deshmukh <[email protected]>
- Replace multiple if statements with a cleaner case statement for package testing - Add support for datacenter-gpu-manager variants in testPkgDownloaded path - Improve code readability and maintainability Signed-off-by: Suraj Deshmukh <[email protected]>
- Fix typo: "locatl" -> "local" in updateDnfWithNvidiaPkg function - Add dnf_makecache call after nvidia repo setup for Mariner - Add missing benchmark capture for nvidia apt update in Ubuntu Signed-off-by: Suraj Deshmukh <[email protected]>
Remove datacenter-gpu-manager-4-cuda12 and datacenter-gpu-manager-4-proprietary-cuda12 packages from the build process. This includes: - Removing package definitions from components.json - Updating package lists in cse_helpers.sh - Removing download logic from install-dependencies.sh - Updating VHD content tests The core and proprietary variants without CUDA 12 suffix remain available. Signed-off-by: Suraj Deshmukh <[email protected]>
- Add new function `startNvidiaDCGMExporterService()` to configure and start NVIDIA DCGM exporter with custom port (19400) to avoid conflicts - Add `enable_managed_gpu_experience()` helper function to check instance metadata for `EnableManagedGPUExperience` tag - Add new error codes for NVIDIA DCGM service failures and tag lookup errors - Update main logic to conditionally install and start NVIDIA DCGM services based on nodepool tags - Replace placeholder variable with proper implementation for managed GPU extensions Signed-off-by: Suraj Deshmukh <[email protected]>
- Add test scenarios for Ubuntu 22.04 and 24.04 with NVIDIA DCGM Exporter - Validate installation of datacenter-gpu-manager packages and exporter - Add validators to check systemd services (nvidia-dcgm, nvidia-dcgm-exporter) are running - Add validator to verify DCGM exporter metrics endpoint is scrapable on port 19400 - Configure GPU-enabled VMs with Standard_NC6s_v3 and EnableManagedGPUExperience tag Signed-off-by: Suraj Deshmukh <[email protected]>
Azure Linux package for the DCGM Exporter is called `dcgm-exporter` instead of `datacenter-gpu-manager-exporter`. This change updates the codebase to use 'dcgm-exporter' for Azure Linux while maintaining 'datacenter-gpu-manager-exporter' for Ubuntu. Changes: - Make `getDCGMPackageNames()` OS-aware to return appropriate package names - Move `dcgm_package_list()` function from common helpers to OS-specific scripts - Update package download logic to handle dcgm-exporter for Azure Linux - Update component versions to use Microsoft's Azure Linux repository - Add dcgm-exporter to VHD content tests Signed-off-by: Suraj Deshmukh <[email protected]>
Split the GPU manager exporter package handling to be OS-specific: - Use 'datacenter-gpu-manager-exporter' for Ubuntu 22.04 and 24.04 - Use 'dcgm-exporter' for AzureLinux 3.0 This change ensures the correct package name is tested based on the target operating system and version. Signed-off-by: Suraj Deshmukh <[email protected]>
Strip epoch from package version (e.g., 1:4.4.1-1 -> 4.4.1-1) for all operating systems, not just Ubuntu. This ensures consistent package version handling across different OS types in the VHD content test. Signed-off-by: Suraj Deshmukh <[email protected]>
- Add length validation when getting expected package versions to prevent index out of bounds errors - Add validation for DCGM_FI_DEV_GPU_UTIL metric to ensure exporter is returning actual GPU metrics Signed-off-by: Suraj Deshmukh <[email protected]>
Add dcgm-exporter component with download location matching the package name for Azure Linux v3.0, this ensures that we don't have to write any custom code for the non-matching folder name and package name. Signed-off-by: Suraj Deshmukh <[email protected]>
When querying JSON paths with gjson, dots in the release version string (e.g., "v3.0") are interpreted as path separators rather than literal characters. This causes incorrect path resolution when looking up package versions. - Escape dots in release string before using in JSON path queries - Update Azure Linux 3 test to use proper version "v3.0" in both device plugin and DCGM tests. - Improve error message to include OS version for better debugging Signed-off-by: Suraj Deshmukh <[email protected]>
Restrict NVIDIA DCGM package installation to only Azure Linux 3.0 by adding an OS version check in `installNvidiaDCGMPkgFromCache()` This ensures compatibility and prevents installation attempts on unsupported versions. Signed-off-by: Suraj Deshmukh <[email protected]>
Consolidate GPU device plugin and DCGM services under managed GPU experience. - Moving nvidia-device-plugin installation and configuration from general GPU setup to managed GPU experience - Renaming functions to reflect broader scope (startNvidiaDCGMExporterService -> startNvidiaManagedExpServices) - Adding package installation skip logic for already installed packages - Updating error constants and logging to be more generic - Adding EnableManagedGPUExperience tag to all GPU test scenarios - Consolidating package management functions across Ubuntu and AzureLinux The managed GPU experience now handles nvidia-device-plugin, nvidia-dcgm, and nvidia-dcgm-exporter as a cohesive unit, providing better control and consistency for GPU workloads. Signed-off-by: Suraj Deshmukh <[email protected]>
Rename NVIDIA DCGM Exporter test functions to use consistent underscore naming convention between OS version and test description: - `Test_Ubuntu2404NvidiaDCGMExporterRunning` -> `Test_Ubuntu2404_NvidiaDCGMExporterRunning` - `Test_Ubuntu2204NvidiaDCGMExporterRunning` -> `Test_Ubuntu2204_NvidiaDCGMExporterRunning` - `Test_AzureLinux3NvidiaDCGMExporterRunning` -> `Test_AzureLinux3_NvidiaDCGMExporterRunning` Signed-off-by: Suraj Deshmukh <[email protected]>
- Rename `enableManagedGPUExperience()` to `enableManagedGPUExperience()` - Rename `isPackageInstalled()` to `isPackageInstalled()` in both Mariner and Ubuntu scripts - Update all function calls to use the new camelCase naming convention This change improves code consistency by adopting a uniform naming convention across the codebase. Signed-off-by: Suraj Deshmukh <[email protected]>
- Move getDCGMPackageNames function from scenario_test.go to scenario_nvidia_device_plugin_test.go - Remove separate DCGM Exporter test functions for Ubuntu 24.04, Ubuntu 22.04, and Azure Linux v3 - Integrate DCGM package validation and service checks into existing NVIDIA device plugin tests - Update test descriptions to reflect combined validation of both device plugin and DCGM Exporter - Refactor hardcoded OS/version strings to use variables for better maintainability This consolidation reduces test duplication and provides comprehensive GPU functionality validation in a single test per OS. Signed-off-by: Suraj Deshmukh <[email protected]>
Rename scenario_nvidia_device_plugin_test.go to scenario_gpu_managed_experience_test.go to better represent the comprehensive GPU testing scenarios covered in the file. Signed-off-by: Suraj Deshmukh <[email protected]>
- Restrict regex matching to specific DCGM packages (datacenter-gpu-manager-4-core, datacenter-gpu-manager-4-proprietary, datacenter-gpu-manager-exporter) - Change repository from 'production' to 'nvidia' for both Ubuntu 22.04 and 24.04 configurations - Update descriptions to reflect DCGM-specific scope Signed-off-by: Suraj Deshmukh <[email protected]>
Ignore AI-related configuration files to prevent them from being tracked in version control. Signed-off-by: Suraj Deshmukh <[email protected]>
Convert function name from snake_case to camelCase for consistency with coding style conventions. Update all function calls in both Mariner and Ubuntu CSE installation scripts. Signed-off-by: Suraj Deshmukh <[email protected]>
Signed-off-by: Suraj Deshmukh <[email protected]>
8cba201 to
831c621
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
The DCGM Exporter provides GPU metrics collection and monitoring capabilities
for NVIDIA GPUs in Kubernetes clusters. This implementation adds the necessary
infrastructure to automatically install, configure, and manage Nvidia Device
Plugin & DCGM services on GPU-enabled nodes when the
EnableManagedGPUExperiencetag is set.Added support for 3 core DCGM packages:
datacenter-gpu-manager-4-coredatacenter-gpu-manager-4-proprietarydcgm-exporter(Azure Linux) /datacenter-gpu-manager-exporter(Ubuntu) -Metrics exporter
Package Repository Setup: Implemented NVIDIA CUDA repository configuration for
both Ubuntu and Azure Linux with support for x86_64 and ARM64 architectures
Systemd Service Configuration:
19400(avoiding conflicts with user-installed instances on default port
9400)configurations
cohesive unit
Feature Activation
EnableManagedGPUExperiencetagvia systemd when explicitly enabled via nodepool tag.
AgentBaker e2e Testing & Validation
as a part of the existing device plugin tests.
and returning GPU-specific metrics like
DCGM_FI_DEV_GPU_UTILAutomated Updates
repositories.
not for the Azure Linux packages. So we may need to manage these
semi-manually for now
Error Handling
characters
Performance Optimizations
the boot process using the CSE scripts.
are not needed.
Requirements: