-
Notifications
You must be signed in to change notification settings - Fork 16
[FEAT] [ROCm] Add ROCm support to fastsafetensors #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] [ROCm] Add ROCm support to fastsafetensors #34
Conversation
|
excellent work! I will review your changes later this week or early next week. Of course, I can help the blog post! |
|
@takeshi-yoshimura |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A major comments are twofold.
- Fastsafetensors dynamically load shared libraries at run time, not compile time. I want you to try doing the same on ROCm library. If that is possible, general ROCm users can avoid building code by themselves, can just use downloaded pip packages.
- Please update and use
get_cuda_ver()inFrameworkOpBaseinstead oftorch.cuda.is_available().
fastsafetensors/cpp/cuda_compat.h
Outdated
| @@ -0,0 +1,33 @@ | |||
| /* | |||
| * Copyright 2024 IBM Inc. All rights reserved | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided not to display Copyright at the first line.
So, please start with SPDX-License-Identifier: Apache-2.0 for a new file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, maybe my comment was confusing. please delete the line for IBM.
| // Custom function pointer names that hipify-perl doesn't recognize | ||
| // These are our own naming in ext_funcs struct, not standard CUDA API | ||
| #ifdef USE_ROCM | ||
| #define cudaDeviceMalloc hipDeviceMalloc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know those kinds of macros enable minimum change for switching to hip APIs, but this will likely cause confusion. We could change callback registration at load_nvidia_functions. But, it now should be renamed to load_gpu_functions or another appropriate name and potentially cause even more renames... I will work on refactoring this later, and so, please just keep them.
fastsafetensors/dlpack.py
Outdated
| """Detect if we're running on ROCm or CUDA""" | ||
| try: | ||
| import torch | ||
| if torch.cuda.is_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to update get_cuda_ver in frameworks/_torch.py and use it instead of calling torch directly? We have a different framework than torch and this change may cause breakage on them. You can add suffix to determine hip or cuda.
|
|
||
|
|
||
| def MyExtension(name, sources, mod_name, *args, **kwargs): | ||
| def detect_platform(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is okay to have this at this moment, but this is not useful for general users to rebuild code by hand.
I hope to see a simplified setup.py with unified binaries that dynamically switch runtimes.
fastsafetensors/cpp/cuda_compat.h
Outdated
| #ifndef USE_ROCM | ||
| #define USE_ROCM | ||
| #endif | ||
| #include <hip/hip_runtime.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if you noticed or not, but fastsafetensors does not include CUDA headers to avoid CUDA dependencies (no -lcudart or other link options in setup.py). It dynamically searches shared library (i.e., libcudart.so) and load symbols on demand with dlopen. I guess this header include is not required if you follow this procedure, but have you tried yet? If my guess is correct, you can simplify setup.py and general users can even use ROCm via pip install fastasfetensors instead of building the code by themselves.
tests/platform_utils.py
Outdated
| def is_rocm_platform(): | ||
| """Detect if running on ROCm/AMD platform.""" | ||
| try: | ||
| import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not use framework specific calls as I commented in another file.
|
One more request is to make sure passing lint. |
|
@takeshi-yoshimura Can you take another look. I have addressed your comments. |
takeshi-yoshimura
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I could not promptly respond. Regarding changes around setup.py, I need to learn more about hipify (e.g., I still don't understand how it converts dynamic function pointers), but it seems not to influence the original build process. So, let's go ahead with your proposal at this moment. maybe I will simplify it later if possible.
Before merging your changes, please make sure passing all the tests. You can find test logs in the last artifact of each run in the Action page.
________________________________ test_framework ________________________________
fstcpp_log = None
framework = <fastsafetensors.frameworks._paddle.PaddleOp object at 0x7f3e0fdd65b0>
def test_framework(fstcpp_log, framework) -> None:
t = framework.get_empty_tensor([1], DType.F16, Device.from_str("cpu"))
with pytest.raises(Exception):
framework.is_equal(t, [float(0.0)])
with pytest.raises(Exception):
framework.get_process_group(int(0))
# Test that get_cuda_ver() returns a string with platform prefix
cuda_ver = framework.get_cuda_ver()
assert isinstance(cuda_ver, str)
# Should be "hip-X.Y.Z", "cuda-X.Y", or "none"
> assert (
cuda_ver.startswith("hip-")
or cuda_ver.startswith("cuda-")
or cuda_ver == "none"
)
E AssertionError: assert (False or False or '0.0' == 'none'
E + where False = <built-in method startswith of str object at 0x7f3e0e39db70>('hip-')
E + where <built-in method startswith of str object at 0x7f3e0e39db70> = '0.0'.startswith
E + and False = <built-in method startswith of str object at 0x7f3e0e39db70>('cuda-')
E + where <built-in method startswith of str object at 0x7f3e0e39db70> = '0.0'.startswith
E
E - none
E + 0.0)
test_fastsafetensors.py:117: AssertionError
fastsafetensors/cpp/cuda_compat.h
Outdated
| @@ -0,0 +1,33 @@ | |||
| /* | |||
| * Copyright 2024 IBM Inc. All rights reserved | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, maybe my comment was confusing. please delete the line for IBM.
|
@takeshi-yoshimura It is ready for review. Hi. I have fixed all the unit tests, both torch and paddle paddle unit tests are passing. I have setup Two new Github Actions:
pip install fastsafetensors --index-url https://embeddedllm.github.io/fastsafetensors-rocm/rocm/simple/
pip install fastsafetensors --index-url https://embeddedllm.github.io/fastsafetensors-rocm/cuda/simple/This is the webpage for proof of concept https://embeddedllm.github.io/fastsafetensors-rocm/ . We could work together to make sure users can pip install fastsafetensors --index-url https://foundation-model-stack.github.io/fastsafetensors/rocm/simple/
pip install fastsafetensors --index-url https://foundation-model-stack.github.io/fastsafetensors/cuda/simple/ |
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
- Add script to auto-generate PyPI-compatible index from GitHub releases - Create separate indexes for CUDA and ROCm wheels based on release tags - Add GitHub Actions workflow to deploy indexes to GitHub Pages - Generate landing page with installation instructions - Ignore auto-generated pypi-index directory Indexes will be available at: - ROCm: https://embeddedllm.github.io/fastsafetensors-rocm/rocm/simple/ - CUDA: https://embeddedllm.github.io/fastsafetensors-rocm/cuda/simple/ Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
| between different GPU platforms without using torch directly. | ||
| """ | ||
| if torch.cuda.is_available(): | ||
| return str(torch.version.cuda) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takeshi-yoshimura This part where it is different from previous is here. And this is the part where it is causing the issue with unit tests. Let me know if it is an issue. Thank you.
Reverted the behavior, when on CPU, it returns version string of "0.0"
| return ( | ||
| str(paddle.version.cuda()) | ||
| if paddle.device.is_compiled_with_cuda() | ||
| else "0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takeshi-yoshimura Same here, this part where it is different from previous is here. And this is the part where it is causing the issue with unit tests. Let me know if it is an issue. Thank you.
Reverted the behavior, when on CPU, it returns version string of "0.0"
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
|
@takeshi-yoshimura Creating Wheels for ReleasesThere are two Github Actions that have to be run to create the wheels.
Create ReleasesThe convention of the releases on Github should be Remember to unzip the artifacts and upload the individual wheels to the Github release page. (Optional) Create Github PyPI IndexIf you have not setup the Github Pages to host the Github PyPI Index, follow the instruction in Update the Github PyPI Index to Include the New Version WheelsTrigger the workflow When the Github Action is completed, you should be able to install through pip install pip install fastsafetensors --index-url https://foundation-model-stack.github.io/fastsafetensors/rocm/simple/
pip install fastsafetensors --index-url https://foundation-model-stack.github.io/fastsafetensors/cuda/simple/ |
|
Thanks. I would like to add three requests:
|
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
|
@takeshi-yoshimura This PR is ready for review.
The I have also updated the instruction as we can now just install from git+https://github.com/foundation-model-stack/fastsafetensors.git , like this pip install git+https://github.com/foundation-model-stack/fastsafetensors.git |
1de5f13
into
foundation-model-stack:main
|
Great work! thanks! |
Motivation
Loading large model weights has been a bottleneck in development. Moreover, there are more and more models that do not have smaller model size counterpart for rapid prototyping and development.
The cold start time during deployment is also a huge downside.
Lucky to have found out this hidden GEM which also has great integration with vLLM.
Purpose
Add support on ROCm. Since at the moment there are no alternative to NVIDIA-GDS, this enable is without GDS support.
From the news on ROCm 7.9 Tech Preview , there is a new library called
hipFile. Hope that we can enable GDS when it is released.Performance
When loading DeepSeek-R1 weight with TP8, we saw a whooping
7.4ximprovement compared to reading usingsafetensorsfrom NVMe using vLLM.NOTE: On vLLM it will by default try to load fastsafetensors with
gds=True, and ifgds=Trueis not supported it will fallback togds=False. Hey, there is no need to update vLLM code and it works after installing this fastsafetensors ROCm support !Tests
It passes all of the core unit test:
tests/test_multi.pytests/test_fastsafetensors.pyI have also been heavily using it for my development on ROCm since the enablement of this feature.
Setup steps.
I have updated the
README.md.It is as easy as
python3 setup.py developUPDATES on setup steps:
Install from Github Source
Install from source
pip install .@takeshi-yoshimura are you interested in collaborating on a vLLM blog post at https://blog.vllm.ai/ showcasing how effective it is
fastsafetensorsin improving cold start efficiency.