Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve observed failure of NCCL plugin installation #219

Merged

Conversation

tpdownes
Copy link
Member

We observe failures of NCCL plugin installation when the default Docker network profile is used because it fails to bind to a real interface that can route the instance metadata server. This causes machine-type verification to fail in some instances.

Additional commit applies shfmt rules throughout the scripts.

We observe failures of NCCL plugin installation when the default Docker
network profile is used because it fails to bind to a real interface
that can route the instance metadata server. This causes machine-type
verification to fail in some instances.
@tpdownes
Copy link
Member Author

This was manually tested on the default a3-highgpu-8g blueprint in https://github.com/GoogleCloudPlatform/cluster-toolkit/ and observed to work.

Before the change:

$ srun -N2 --label md5sum /var/lib/tcpx/lib64/libnccl-net.so
0: d20b62ba38cd140c54a16d46982a43ef  /var/lib/tcpx/lib64/libnccl-net.so
1: d20b62ba38cd140c54a16d46982a43ef  /var/lib/tcpx/lib64/libnccl-net.so

After the change:

$ srun -N2 --label md5sum /var/lib/tcpx/lib64/libnccl-net.so
1: 293526e53c204f583903a51fde9aed58  /var/lib/tcpx/lib64/libnccl-net.so
0: 293526e53c204f583903a51fde9aed58  /var/lib/tcpx/lib64/libnccl-net.so

@mr0re1 mr0re1 removed their assignment Sep 30, 2024
Copy link
Collaborator

@samskillman samskillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and resolves the observed failure.

@tpdownes tpdownes merged commit b400646 into GoogleCloudPlatform:master Sep 30, 2024
1 of 2 checks passed
@tpdownes tpdownes deleted the fix_nccl_plugin_installation branch September 30, 2024 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants