Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In Google Batch, install GPU drivers for GPU VMs #5406

Merged
merged 7 commits into from
Feb 13, 2025

Conversation

siddharthab
Copy link
Contributor

Also clean up some old logic for container options when using GPUs.
These are now automatically handled by Google Cloud.

Fixes #5372.

Signed-off-by: Siddhartha Bagaria [email protected]

Also clean up some old logic for container options when using GPUs.
These are now automatically handled by Google Cloud.

Fixes nextflow-io#5372.

Signed-off-by: Siddhartha Bagaria <[email protected]>
Copy link

netlify bot commented Oct 17, 2024

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 214f290
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67ae19f953ba7c0008c53aeb

@siddharthab siddharthab marked this pull request as ready for review October 17, 2024 00:28
Comment on lines -227 to +228
final accel = task.config.getAccelerator()
// add nvidia specific driver paths
// see https://cloud.google.com/batch/docs/create-run-job#create-job-gpu
if( accel && accel.type.toLowerCase().startsWith('nvidia-') ) {
container
.addVolumes('/var/lib/nvidia/lib64:/usr/local/nvidia/lib64')
.addVolumes('/var/lib/nvidia/bin:/usr/local/nvidia/bin')
}

def containerOptions = task.config.getContainerOptions() ?: ''
// accelerator requires privileged option
// https://cloud.google.com/batch/docs/create-run-job#create-job-gpu
if( task.config.getAccelerator() || fusionEnabled() ) {
if( fusionEnabled() ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this logic removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem to be needed anymore since Google does this for you automatically if installGpuDrivers bit is set. It is not mentioned in the official Google documentation as well.

The only thing is that people might still need to add /usr/local/nvidia/bin to their PATH and /usr/local/nvidia/lib64 to their LD_LIBRARY_PATH depending on their app (some apps automatically look there anyway), but it is not happening right now as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the library path should be handled by the container

Co-authored-by: Ben Sherman <[email protected]>
Signed-off-by: Siddhartha Bagaria <[email protected]>
@bentsherman
Copy link
Member

@siddharthab overall the changes make sense. I think we'll need to do some testing on our end to make sure the machine type selection still works with Fusion + SSD

Can you confirm that it works for you with some GPU-enabled tasks?

Signed-off-by: Siddhartha Bagaria <[email protected]>
@siddharthab
Copy link
Contributor Author

Yes, I confirm that I tested the following configurations:

GPU Directive bootDiskImage Fusion Result
machineType batch-cos No PASS
accelerator batch-cos No PASS
machineType batch-cos Yes PASS†
accelerator batch-cos Yes PASS
machineType batch-debian No FAIL‡
accelerator batch-debian Yes FAIL‡

† Needed new logic in findValidLocalSSDSize for the larger machine types, now pushed as an update to this PR.

‡ It looks like installGpuDrivers works properly only when bootDiskImage is batch-cos (the default). With batch-debian, the installer aborted with the following in journalctl. This is independent of any changes made in this PR (mentioning here because it came up in my testing).

google_metadata_script_runner[846]: startup-script-url: [2024-10-09 19:40:44] Executing: lspci -n
google_metadata_script_runner[846]: startup-script-url:
google_metadata_script_runner[846]: startup-script-url: 00:00.0 0600: 8086:1237 (rev 02)
google_metadata_script_runner[846]: startup-script-url: 00:01.0 0601: 8086:7110 (rev 03)
google_metadata_script_runner[846]: startup-script-url: 00:01.3 0680: 8086:7113 (rev 03)
google_metadata_script_runner[846]: startup-script-url: 00:03.0 0000: 1af4:1004
google_metadata_script_runner[846]: startup-script-url: 00:04.0 0200: 1af4:1000
google_metadata_script_runner[846]: startup-script-url: 00:05.0 00ff: 1af4:1005
google_metadata_script_runner[846]: startup-script-url:
google_metadata_script_runner[846]: startup-script-url: There doesn't seem to be a GPU unit connected to your system. Aborting drivers installation.

@siddharthab
Copy link
Contributor Author

@bentsherman The test failures seem unrelated to this change. Please let me know if you would like to see anything more in this PR.

@siddharthab
Copy link
Contributor Author

@bentsherman Any thoughts?

@siddharthab
Copy link
Contributor Author

@pditommaso Bringing it to your attention in case you missed this PR. It is now waiting for your final approval.

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46
@pditommaso
Copy link
Member

Testing against platform showcase pipelines

@pditommaso
Copy link
Member

ok, some tests fails because credentials are not added into foreign branches. let's merge anyway

@pditommaso pditommaso merged commit 420fb17 into nextflow-io:master Feb 13, 2025
17 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support accelerator optimized VMs in Google Cloud
3 participants