Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In Google Batch, install GPU drivers for GPU VMs #5406

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

siddharthab
Copy link
Contributor

Also clean up some old logic for container options when using GPUs.
These are now automatically handled by Google Cloud.

Fixes #5372.

Signed-off-by: Siddhartha Bagaria [email protected]

Also clean up some old logic for container options when using GPUs.
These are now automatically handled by Google Cloud.

Fixes nextflow-io#5372.

Signed-off-by: Siddhartha Bagaria <[email protected]>
Copy link

netlify bot commented Oct 17, 2024

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 9c13f7a
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/6745d4b54c78640008f592db

@siddharthab siddharthab marked this pull request as ready for review October 17, 2024 00:28
Co-authored-by: Ben Sherman <[email protected]>
Signed-off-by: Siddhartha Bagaria <[email protected]>
@bentsherman
Copy link
Member

@siddharthab overall the changes make sense. I think we'll need to do some testing on our end to make sure the machine type selection still works with Fusion + SSD

Can you confirm that it works for you with some GPU-enabled tasks?

Signed-off-by: Siddhartha Bagaria <[email protected]>
@siddharthab
Copy link
Contributor Author

Yes, I confirm that I tested the following configurations:

GPU Directive bootDiskImage Fusion Result
machineType batch-cos No PASS
accelerator batch-cos No PASS
machineType batch-cos Yes PASS†
accelerator batch-cos Yes PASS
machineType batch-debian No FAIL‡
accelerator batch-debian Yes FAIL‡

† Needed new logic in findValidLocalSSDSize for the larger machine types, now pushed as an update to this PR.

‡ It looks like installGpuDrivers works properly only when bootDiskImage is batch-cos (the default). With batch-debian, the installer aborted with the following in journalctl. This is independent of any changes made in this PR (mentioning here because it came up in my testing).

google_metadata_script_runner[846]: startup-script-url: [2024-10-09 19:40:44] Executing: lspci -n
google_metadata_script_runner[846]: startup-script-url:
google_metadata_script_runner[846]: startup-script-url: 00:00.0 0600: 8086:1237 (rev 02)
google_metadata_script_runner[846]: startup-script-url: 00:01.0 0601: 8086:7110 (rev 03)
google_metadata_script_runner[846]: startup-script-url: 00:01.3 0680: 8086:7113 (rev 03)
google_metadata_script_runner[846]: startup-script-url: 00:03.0 0000: 1af4:1004
google_metadata_script_runner[846]: startup-script-url: 00:04.0 0200: 1af4:1000
google_metadata_script_runner[846]: startup-script-url: 00:05.0 00ff: 1af4:1005
google_metadata_script_runner[846]: startup-script-url:
google_metadata_script_runner[846]: startup-script-url: There doesn't seem to be a GPU unit connected to your system. Aborting drivers installation.

@siddharthab
Copy link
Contributor Author

@bentsherman The test failures seem unrelated to this change. Please let me know if you would like to see anything more in this PR.

@siddharthab
Copy link
Contributor Author

@bentsherman Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support accelerator optimized VMs in Google Cloud
3 participants