GC less effective in AMDGPU than CUDA #683

evelyne-ringoot · 2024-10-05T02:57:57Z

Creating a multitude of small copies for benchmarking slows AMDGPU.jl down a lot, something not observed in CUDA.jl. The solution for this specific code is to avoid allocations all together, but this is (maybe?) not possible with every type of code. (I also remember having had some issues with benchmarktools, but cannot manage to reproduce them right now) Sharing the code here for future reference:

using AMDGPU, BSON
n_values=(2 .^(1:14))
timings=zeros(2,length(n_values))

function mybelapsed(A, B)
   AMDGPU.rocBLAS.gemm('N','N',copy(A),copy(B))
   t=0.0
   k=0
   while (k<1e5 && t<1)
       Acpy=copy(A)
       Bcpy=copy(B)
       AMDGPU.synchronize()
       t+= @elapsed (AMDGPU.@sync AMDGPU.rocBLAS.gemm('N','N',Acpy,Bcpy))
       AMDGPU.synchronize()
       k+=1
    end
    return t/k
end

function mybelapsed2(A, B)
   AMDGPU.rocBLAS.gemm('N','N',copy(A),copy(B))
   t=0.0
   k=0
   Acpy=copy(A)
   Bcpy=copy(B)
   if(k<1e5 && t<1)
       AMDGPU.synchronize()
       t+= @elapsed (AMDGPU.@sync AMDGPU.rocBLAS.gemm('N','N',Acpy,Bcpy);)
       AMDGPU.synchronize()
       Acpy.=A
       Bcpy.=B
       k+=1
    end
    return t/k
end


for (i,n) in enumerate(n_values)
   A=ROCArray(rand(Float32,n,n));
   B=ROCArray(rand(Float32,n,n));
   println(n)
   timings[1,i]=mybelapsed(A,B)
   GC.gc()
   sleep(1)
   timings[2,i]=mybelapsed2(A,B)
   GC.gc()
   sleep(1)
   BSON.@save "AMD_matmul_bench.bson" timings
end

Adding AMDGPU.unsafe_free! in every iteration does not solve this problem either, neither does turning GC off, and manually running GC.enable(true); AMDGPU.unsafe_free!(Acpy); AMDGPU.unsafe_free!(Bcpy); GC.gc(); sleep(0.001); GC.enable(false); between every iteration. The same code with AMDGPU replaced by CUDA (and ROCblasgemm by Acpy*Bcpy) shows barely any performance difference between both codes (even slightly better and more stable performance when using copies):

Versions:

julia> versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7302 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
┌───────────┬──────────────────┬───────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                                    │
├───────────┼──────────────────┼───────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /opt/rocm/llvm/bin/ld.lld                                                               │
│     +     │ Device Libraries │ -         │ /home/eringoot/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode │
│     +     │ HIP              │ 6.0.32831 │ /opt/rocm-6.0.2/lib/libamdhip64.so                                                      │
│     +     │ rocBLAS          │ 4.0.0     │ /opt/rocm-6.0.2/lib/librocblas.so                                                       │
│     +     │ rocSOLVER        │ 3.24.0    │ /opt/rocm-6.0.2/lib/librocsolver.so                                                     │
│     +     │ rocALUTION       │ -         │ /opt/rocm-6.0.2/lib/librocalution.so                                                    │
│     +     │ rocSPARSE        │ -         │ /opt/rocm-6.0.2/lib/librocsparse.so                                                     │
│     +     │ rocRAND          │ 2.10.5    │ /opt/rocm-6.0.2/lib/librocrand.so                                                       │
│     +     │ rocFFT           │ 1.0.27    │ /opt/rocm-6.0.2/lib/librocfft.so                                                        │
│     +     │ MIOpen           │ 3.0.0     │ /opt/rocm-6.0.2/lib/libMIOpen.so                                                        │
└───────────┴──────────────────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬────────────────────────┬────────────────────────┬───────────┬────────────┐
│ Id │                   Name │               GCN arch │ Wavefront │     Memory │
├────┼────────────────────────┼────────────────────────┼───────────┼────────────┤
│  1 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  2 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  3 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  4 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  5 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  6 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  7 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  8 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
└────┴────────────────────────┴────────────────────────┴───────────┴────────────┘

@jpsamaroo @vchuravy @pxl-th

The text was updated successfully, but these errors were encountered:

luraess · 2024-10-05T08:20:48Z

Thanks for reporting. It would be interesting to profile further using rocprof and compare the trace with CUDA Nsight to see where the slowdown occurs when using copying. Seems the extra copies on AMDGPU keeps somehow the device busy avoiding it to perform the compute tasks at expected perf for small arrays.

evelyne-ringoot · 2024-10-05T15:24:19Z

Update: if in the same code timings[1,i]=mybelapsed(A,B) is commented out, the second belapsed becomes slow too, I am really confused now

pxl-th · 2025-03-02T12:11:41Z

Closing this as we now have caching allocator which avoids GC and allows for fast reuse of allocations:
https://juliagpu.github.io/GPUArrays.jl/dev/interface/#Caching-Allocator
It can also be used with @btime to avoid blowing up memory usage:

cache = GPUArrays.AllocCache()
@btime AMDGPU.@sync GPUArrays.@cached $cache fn(...)
GPUArrays.unsafe_free!(cache)

evelyne-ringoot mentioned this issue Oct 5, 2024

Warning about benchmarktools #684

Closed

pxl-th closed this as completed Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC less effective in AMDGPU than CUDA #683

GC less effective in AMDGPU than CUDA #683

evelyne-ringoot commented Oct 5, 2024 •

edited

Loading

luraess commented Oct 5, 2024

evelyne-ringoot commented Oct 5, 2024

pxl-th commented Mar 2, 2025 •

edited

Loading

GC less effective in AMDGPU than CUDA #683

GC less effective in AMDGPU than CUDA #683

Comments

evelyne-ringoot commented Oct 5, 2024 • edited Loading

luraess commented Oct 5, 2024

evelyne-ringoot commented Oct 5, 2024

pxl-th commented Mar 2, 2025 • edited Loading

evelyne-ringoot commented Oct 5, 2024 •

edited

Loading

pxl-th commented Mar 2, 2025 •

edited

Loading