Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Jul 1, 2025

This patch adds extra functionality to the apply-exit hook, calling the nvidia-ctk cdi generate to refresh the CDI config on a MIG (re)configuration event

@ArangoGutierrez ArangoGutierrez self-assigned this Jul 1, 2025

This comment was marked as outdated.

@ArangoGutierrez ArangoGutierrez requested a review from klueska July 1, 2025 11:49
fi
}

function refresh-cdi-config() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ArangoGutierrez ArangoGutierrez force-pushed the cdi_generate_hook branch 2 times, most recently from cf9c0bd to 83b4d05 Compare July 1, 2025 15:08
@ArangoGutierrez ArangoGutierrez requested a review from elezar July 1, 2025 15:09
@ArangoGutierrez ArangoGutierrez changed the title Add refresh-cdi hook Extend the functionality of apply-exit to refresh CDI config Jul 2, 2025
@ArangoGutierrez
Copy link
Collaborator Author

PR Edited
@elezar / @klueska PTAL

This comment was marked as outdated.

@elezar
Copy link
Member

elezar commented Jul 10, 2025

@ArangoGutierrez how are we specifying WHERE the output should be created? Although @klueska had concerns about triggering the nvidia-cdi-refresh.service doing so would mean that we would ensure consistency in terms of the ouptut path.

Also, on:

Given that NVIDIA/nvidia-container-toolkit#1107 is merged, if I remove the --output from this PR that should be enough to allow the user to specify custom configuration via the config file

I don't think that specifying an output path in the config file makes sense since this would complicate the default behvaiour which is to output the spec to STDOUT.

@ArangoGutierrez
Copy link
Collaborator Author

@ArangoGutierrez how are we specifying WHERE the output should be created? Although @klueska had concerns about triggering the nvidia-cdi-refresh.service doing so would mean that we would ensure consistency in terms of the ouptut path.

I would like to call the nvidia-cdi-refresh.service from the hook, maybe something to discuss with @klueska

@ArangoGutierrez
Copy link
Collaborator Author

@klueska / @elezar Updated based on today's discussion

This comment was marked as outdated.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the apply-exit hook functionality to refresh CDI (Container Device Interface) configuration after MIG reconfiguration events. The implementation adds a new hook that triggers CDI configuration updates through a systemd service.

Key changes:

  • Adds a new cdi-refresh hook to the hooks configuration
  • Creates a new script that conditionally starts the nvidia-cdi-refresh systemd service
  • Integrates CDI refresh functionality into the MIG manager's hook system

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
deployments/systemd/hooks-default.yaml Adds new cdi-refresh hook configuration that executes the CDI refresh script
deployments/systemd/cdi-refresh.sh New shell script that conditionally starts nvidia-cdi-refresh.service if available

@ArangoGutierrez
Copy link
Collaborator Author

Refresher PING @klueska / @elezar

@ArangoGutierrez
Copy link
Collaborator Author

Now that we have Toolkit RC6 and are almost ready for proper release, and Systemd is stable, can we give this PR another try?
@klueska / @elezar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants