Skip to content

Conversation

valenzuelaomar
Copy link
Collaborator

@valenzuelaomar valenzuelaomar commented May 2, 2025

Resolves #238

This pull request introduces GPU acceleration support for tasks using cuML, along with related updates to the codebase and documentation. The changes include adding installation instructions, updating dependencies, and enabling GPU acceleration if available.

GPU Acceleration Support:

  • Documentation Update: Added a section in README.md explaining how to install and enable GPU acceleration using cuML for improved performance. Includes installation steps for both source and PyPI installations.
  • Optional Dependency: Added a new optional dependency group gpu in pyproject.toml that includes cuml-cu12==25.4.*.
  • GPU Initialization: Introduced _enable_gpu_acceleration() in src/czbenchmarks/__init__.py to initialize GPU acceleration with cuML if available, logging the status of GPU support

Benchmarks (regular sklearn vs gpu-accelerated sklearn)

image
image is from: https://developer.nvidia.com/blog/nvidia-cuml-brings-zero-code-change-acceleration-to-scikit-learn/#benchmarks

How this works

image

Tests

Ran with regular installation

uv pip install -e .
czbenchmarks run --models SCGPT --scgpt-model-variant human --datasets tsv2_bladder --tasks clustering --clustering-task-label-key cell_type

Ran with regular installation + GPU acceleration

uv pip install -e ".[gpu]"
czbenchmarks run --models SCGPT --scgpt-model-variant human --datasets tsv2_bladder --tasks clustering --clustering-task-label-key cell_type

Known Limitations

cuML automatically accelerates compatible components on NVIDIA GPUs and falls back to CPU execution for unsupported operations.
https://docs.rapids.ai/api/cuml/stable/zero-code-change-limitations/

@valenzuelaomar valenzuelaomar changed the title feat: speedup tasks by >50x with cuML feat: speedup tasks runtime with cuML May 2, 2025
@valenzuelaomar valenzuelaomar changed the title feat: speedup tasks runtime with cuML feat: speedup task runtime with cuML May 2, 2025
@steveherrin
Copy link
Contributor

steveherrin commented May 6, 2025

For what it's worth, this did not work for me. It was erroring out on the import cuml.accel line with an OSError due to not finding libcudart.so (the code is only catching ImportErrors). I had to sudo apt install nvidia-cuda-toolkit (on ubuntu) to get it to work properly.

@valenzuelaomar
Copy link
Collaborator Author

For what it's worth, this did not work for me. It was erroring out on the import cuml.accel line with an OSError due to not finding libcudart.so (the code is only catching ImportErrors). I had to sudo apt install nvidia-cuda-toolkit (on ubuntu) to get it to work properly.

Thanks for reporting! In the README.md I did specify to install the nvidia-cuda-toolkit, but not in the PR description so I apologize for that

@steveherrin
Copy link
Contributor

Oops, I see it now. I somehow missed it 😓

@steveherrin
Copy link
Contributor

steveherrin commented May 6, 2025

Running the labeling task using cached embeddings for one model (UCE 4l) on one tissues and with --set-baseline, on a g4dn.8xlarge instance, with the acceleration:

real    47m9.206s
user    43m26.473s
sys     4m3.370s

without the acceleration (so without these changes):

real    40m43.173s
user    198m47.306s
sys     1m51.516s

so no real speedup

@mlgill
Copy link
Collaborator

mlgill commented May 7, 2025

Running the labeling task using cached embeddings for one model (UCE 4l) on one tissues and with --set-baseline, on a g4dn.8xlarge instance, with the acceleration:

real    47m9.206s
user    43m26.473s
sys     4m3.370s

without the acceleration (so without these changes):

real    40m43.173s
user    198m47.306s
sys     1m51.516s

so no real speedup

I haven't tried the cuml accelerator yet, so I don't know specifics. But in general, needs to be a large amount of data to offset the time to move data on and off of GPU. My guess is that UCE-4l embeddings aren't enough. UCE-33l embeddings might show acceleration with GPU.

There is some dependency on GPU and on specific ML algorithm too.

@valenzuelaomar Maybe the documentation should be updated ot indicate that GPU acceleration can vary based on amout of data, algorithm, and type of GPU?

@valenzuelaomar
Copy link
Collaborator Author

Running the labeling task using cached embeddings for one model (UCE 4l) on one tissues and with --set-baseline, on a g4dn.8xlarge instance, with the acceleration:

real    47m9.206s
user    43m26.473s
sys     4m3.370s

without the acceleration (so without these changes):

real    40m43.173s
user    198m47.306s
sys     1m51.516s

so no real speedup

I haven't tried the cuml accelerator yet, so I don't know specifics. But in general, needs to be a large amount of data to offset the time to move data on and off of GPU. My guess is that UCE-4l embeddings aren't enough. UCE-33l embeddings might show acceleration with GPU.

There is some dependency on GPU and on specific ML algorithm too.

@valenzuelaomar Maybe the documentation should be updated ot indicate that GPU acceleration can vary based on amout of data, algorithm, and type of GPU?

@mlgill you're spot on about there needing to be a large amount of data to see the gpu-acceleration benefits. I think updating documentation to reference that is a good idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add cuML acceleration for task computations
5 participants