-
Notifications
You must be signed in to change notification settings - Fork 16
[FEAT] [ROCm] Add ROCm support to fastsafetensors #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
takeshi-yoshimura
merged 48 commits into
foundation-model-stack:main
from
EmbeddedLLM:upstream
Nov 11, 2025
Merged
Changes from all commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
b844ac0
add rocm support
tjtanaa 73b2674
add rocm documentation
tjtanaa 663e8b6
add amd perf benchmark data
tjtanaa 93e247e
fix image
tjtanaa 962010c
address reviewer feedback
tjtanaa f7e617d
remove torch dependencies
tjtanaa 72eb785
handle when cuda and rocm not found
tjtanaa 6ddba67
fix paddle paddle
tjtanaa 8714185
setup rocm wheel ci build
tjtanaa 5fb3a14
manual trigger rocm workflow
tjtanaa 5e2e4af
remove system package steps
tjtanaa 75530b5
install system dependencies
tjtanaa 6ad6bb2
upgrade ubuntu version, skip tests
tjtanaa ab00395
upgrade ubuntu version, skip tests
tjtanaa 9a11ce3
install from python from offical source
tjtanaa d6c1f4a
use venv instead
tjtanaa 5d24383
fix other python ci build
tjtanaa 5eba4b5
build many linux
tjtanaa c543e16
remove rocm_ tag from platform tag
tjtanaa 8ea1956
Add automated PyPI index with separate CUDA/ROCm backends
tjtanaa f6101fc
only manual trigger when deploying pypi index
tjtanaa 643d12d
add publish to index GA workflow
tjtanaa 0935d8c
fix the publish to index
tjtanaa 378b262
fix the publish to index write permission
tjtanaa 5efe4a9
fix the publish to index for both mode
tjtanaa 8871043
fix the manylinux rocmwheel build
tjtanaa 619c531
update publish to index
tjtanaa 6a3f302
fix nested dumb-pypi
tjtanaa c3a580b
fix publish to index
tjtanaa 4eca2b7
fix the dumb-pypi
tjtanaa 2f15898
fix the package path
tjtanaa 8ffcefa
lint
tjtanaa f09cb54
remove deploy-pypi-index
tjtanaa 821fb41
remove unused code
tjtanaa e9ff27f
update publish to index to handle version isolation
tjtanaa eff48c7
fix publish to index yaml syntax
tjtanaa 4dcbb76
fix publish to index syntax error
tjtanaa 86faab4
add workflow python script
tjtanaa 2b12a97
only bundle the dependencies specified in the pyproject.toml
tjtanaa f0ec845
bugfix the workflow
tjtanaa 5291031
fixing the publsih to index workflow
tjtanaa 1f2e60c
fixing the publsih to index workflow
tjtanaa bf274a1
update workflow instruction
tjtanaa c5b4886
only allow publish to index be triggered manually
tjtanaa d631e44
remove github workflow
tjtanaa 8832411
sync with upstream
tjtanaa e353538
update installation procedure on ROCm
tjtanaa 8fd8b99
fix installation command
tjtanaa File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # Performance of FastSafeTensors on AMD GPUs | ||
|
|
||
| ## DeepSeek-R1 vLLM Model Weight Loading Speed | ||
|
|
||
| This benchmark compares the performance of `safetensors` vs `fastsafetensors` when loading model weights on AMD GPUs. | ||
|
|
||
| NOTES: `fastsafetensors` does not support GDS feature on ROCm as there are no GDS alternative on ROCm. | ||
|
|
||
| ### Benchmark Methodology | ||
|
|
||
| **Platform:** AMD ROCm 7.0.1 | ||
| **GPUs:** 8x AMD Instinct MI300X | ||
| **Library:** fastsafetensors 0.1.15 | ||
|
|
||
| 1. **Clear system cache** to ensure consistent starting conditions: | ||
| ```bash | ||
| sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches' | ||
| ``` | ||
|
|
||
| 2. **Launch vLLM** with either `--load-format safetensors` or `--load-format fastsafetensors`: | ||
|
|
||
| ```bash | ||
| MODEL=EmbeddedLLM/deepseek-r1-FP8-Dynamic | ||
|
|
||
| VLLM_USE_V1=1 \ | ||
| VLLM_ROCM_USE_AITER=1 \ | ||
| vllm serve $MODEL \ | ||
| --tensor-parallel-size 8 \ | ||
| --disable-log-requests \ | ||
| --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ | ||
| --trust-remote-code \ | ||
| --load-format fastsafetensors \ | ||
| --block-size 1 | ||
| ``` | ||
|
|
||
| ### Results | ||
|
|
||
| The experiments are carried on MI300X. | ||
|
|
||
| **Cache Scenarios:** | ||
| - **No cache**: Model weights are loaded after clearing the system cache (cold start). | ||
| - **Cached**: Model weights are loaded immediately after a previous load. The weights are cached in the filesystem and RAM (warm start). | ||
|
|
||
| <img src="./images/fastsafetensors-rocm.png" alt="FastSafeTensors on ROCm" width="70%"> | ||
|
|
||
|
|
||
|
|
||
|
|
||
| ## GPT-2 perf tests based on the script [perf/fastsafetensors_perf/perf.py](../perf/fastsafetensors_perf/perf.py) | ||
|
|
||
| ### Test Configuration | ||
|
|
||
| All tests were performed on single-GPU loading scenarios with two different model sizes: | ||
| - **GPT-2 (small):** 523MB safetensors file | ||
| - **GPT-2 Medium:** ~1.4GB safetensors file | ||
|
|
||
| #### Key Parameters Tested: | ||
| - **nogds mode:** ROCm fallback (GDS not available on AMD GPUs) | ||
| - **Thread counts:** 8, 16, 32 | ||
| - **Buffer sizes:** 8MB, 16MB, 32MB | ||
| - **Loading methods:** nogds (async I/O), mmap (memory-mapped) | ||
| - **Data types:** AUTO (no conversion), F16 (half precision conversion) | ||
|
|
||
| --- | ||
|
|
||
| #### Performance Results | ||
|
|
||
| ##### GPT-2 (523MB) - Single GPU Tests | ||
|
|
||
| | Test # | Method | Threads | Buffer | Config | Bandwidth | Elapsed Time | Notes | | ||
| |--------|--------|---------|--------|--------|-----------|--------------|-------| | ||
| | 1 | nogds | 16 | 16MB | default | **1.91 GB/s** | 0.268s | Baseline test | | ||
| | 2 | nogds | 32 | 32MB | default | **2.07 GB/s** | 0.246s | Higher threads/buffer | | ||
| | 3 | nogds | 8 | 8MB | default | **2.10 GB/s** | 0.243s | Lower threads/buffer | | ||
| | 4 | mmap | N/A | N/A | default | **1.01 GB/s** | 0.505s | Memory-mapped | | ||
| | 5 | nogds | 32 | 32MB | cache-drop | **1.24 GB/s** | 0.410s | Cold cache test | | ||
| | 6 | nogds | 32 | 32MB | F16 dtype | **0.77 GB/s** | 0.332s | With type conversion | | ||
| | 8 | nogds | 16 | 16MB | **optimal** | **2.62 GB/s** | 0.195s | Best config | | ||
|
|
||
| ##### GPT-2 Medium (1.4GB) - Single GPU Tests | ||
|
|
||
| | Test # | Method | Threads | Buffer | Block Size | Bandwidth | Elapsed Time | Notes | | ||
| |--------|--------|---------|--------|------------|-----------|--------------|-------| | ||
| | 9 | nogds | 16 | 16MB | 160MB | **6.02 GB/s** | 0.235s | Optimal config | | ||
| | 10 | mmap | N/A | N/A | N/A | **1.28 GB/s** | 1.104s | Memory-mapped | | ||
| | 11 | nogds | 32 | 32MB | 160MB | **5.34 GB/s** | 0.265s | Higher threads | | ||
|
|
||
| --- |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| /* | ||
| * CUDA/HIP compatibility layer for fastsafetensors | ||
| * Minimal compatibility header - only defines what hipify-perl doesn't handle | ||
| */ | ||
|
|
||
| #ifndef __CUDA_COMPAT_H__ | ||
| #define __CUDA_COMPAT_H__ | ||
|
|
||
| // Platform detection - this gets hipified to check __HIP_PLATFORM_AMD__ | ||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| #ifndef USE_ROCM | ||
| #define USE_ROCM | ||
| #endif | ||
| // Note: We do NOT include <hip/hip_runtime.h> here to avoid compile-time dependencies. | ||
| // Instead, we dynamically load the ROCm runtime library (libamdhip64.so) at runtime | ||
| // using dlopen(), just like we do for CUDA (libcudart.so). | ||
| // Minimal types are defined in ext.hpp. | ||
| #else | ||
| // For CUDA platform, we also avoid including headers and define minimal types in ext.hpp | ||
| #endif | ||
|
|
||
| // Runtime library name - hipify-perl doesn't change string literals | ||
| #ifdef USE_ROCM | ||
| #define GPU_RUNTIME_LIB "libamdhip64.so" | ||
| #else | ||
| #define GPU_RUNTIME_LIB "libcudart.so" | ||
| #endif | ||
|
|
||
| // Custom function pointer names that hipify-perl doesn't recognize | ||
| // These are our own naming in ext_funcs struct, not standard CUDA API | ||
| #ifdef USE_ROCM | ||
| #define cudaDeviceMalloc hipDeviceMalloc | ||
| #define cudaDeviceFree hipDeviceFree | ||
| #endif | ||
|
|
||
| #endif // __CUDA_COMPAT_H__ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know those kinds of macros enable minimum change for switching to hip APIs, but this will likely cause confusion. We could change callback registration at
load_nvidia_functions. But, it now should be renamed toload_gpu_functionsor another appropriate name and potentially cause even more renames... I will work on refactoring this later, and so, please just keep them.