-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC less effective in AMDGPU than CUDA #683
Comments
Thanks for reporting. It would be interesting to profile further using |
Update: if in the same code |
Closing this as we now have caching allocator which avoids GC and allows for fast reuse of allocations: cache = GPUArrays.AllocCache()
@btime AMDGPU.@sync GPUArrays.@cached $cache fn(...)
GPUArrays.unsafe_free!(cache) |
Creating a multitude of small copies for benchmarking slows AMDGPU.jl down a lot, something not observed in CUDA.jl. The solution for this specific code is to avoid allocations all together, but this is (maybe?) not possible with every type of code. (I also remember having had some issues with benchmarktools, but cannot manage to reproduce them right now) Sharing the code here for future reference:
Adding AMDGPU.unsafe_free! in every iteration does not solve this problem either, neither does turning GC off, and manually running

GC.enable(true); AMDGPU.unsafe_free!(Acpy); AMDGPU.unsafe_free!(Bcpy); GC.gc(); sleep(0.001); GC.enable(false);
between every iteration. The same code with AMDGPU replaced by CUDA (and ROCblasgemm by Acpy*Bcpy) shows barely any performance difference between both codes (even slightly better and more stable performance when using copies):Versions:
@jpsamaroo @vchuravy @pxl-th
The text was updated successfully, but these errors were encountered: