Skip to content

Conversation

Alexey-Rivkin
Copy link
Contributor

What?

  • Accept nvidia_peermem alongside nv_peer_mem in GPU peer-memory checks in buildlib/az-helpers.sh.
  • Remove checking the legacy nv_peer_mem systemd service.
  • Keep try_load_cuda_env() verifying via /sys/kernel/mm/memory_peers/nv_mem/version.

Why?

  • Hosts are migrating to nvidia_peermem. This avoids false negatives and removes dependency on a legacy service while preserving backward compatibility.

How?

  • check_nv_peer_mem(): pass if either nvidia_peermem or nv_peer_mem is loaded; drop service status check.
  • try_load_cuda_env(): unchanged behavior, still relies on the sysfs version file.

- Pass GPU peer-memory check if either nvidia_peermem or nv_peer_mem is loaded
- Remove checking for old nv_peer_mem systemd service
- Keep try_load_cuda_env() behavior: verify via /sys/kernel/mm/memory_peers/nv_mem/version
- Remove redundant lsmod grep and stop referencing nv_peer_mem systemd service
- Support hosts migrating to nvidia_peermem while preserving backward compatibility

Signed-off-by: Alexey Rivkin <[email protected]>
AZP: accept nvidia_peermem and drop legacy service check

Signed-off-by: Alexey Rivkin <[email protected]>
if ! lsmod | grep -q 'nv.*_peer.*mem'; then
lsmod | grep 'nv.*_peer.*mem'
systemctl status nv_peer_mem
azure_log_error "nv_peer_mem module not loaded on $(hostname -s)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change the message to say nv_peer_mem|nvidia_peermem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just "NV peer memory module not loaded on $(hostname -s)"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants