Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance example Parameter-Parallelism with GPU Ensemble Methods #235

Open
roflmaostc opened this issue Feb 14, 2023 · 3 comments
Open

Comments

@roflmaostc
Copy link

Hi,

just copy pasting the example:

# ╔═╡ 0094c9ce-0bf7-40ff-b6df-dcddf94828cd
using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA

# ╔═╡ 0c65ebbf-95b8-44eb-8789-8c03c29442d8
function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] *- u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

# ╔═╡ faf01799-9b68-463d-b439-504d78ebb2e6
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]

# ╔═╡ fd96be2f-eab0-4cf9-9680-8af8aa5fc54f
tspan = (0.0f0, 10.0f0)

# ╔═╡ 64bb027c-bc7b-443d-a712-8c2c9bf577ac
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]

# ╔═╡ 4b77d5b2-2988-4c2a-ab51-a424a016d85a
prob = ODEProblem{false}(lorenz, u0, tspan, p)

# ╔═╡ f5c3ee38-59b5-4840-9cea-ea2c1a52c512
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)

# ╔═╡ 169bb61d-8a59-4ad1-9936-d594ce5eb2de
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

# ╔═╡ 097eeba0-3608-498f-9574-b257b6e9604a
@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                  adaptive = false, dt = 0.1f0)

# ╔═╡ 5cccc317-125b-4c23-abcf-09292ab9c7e8
CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 100_000,
                  adaptive = false, dt = 0.1f0)

I don't see an improvement with the GPU version.

# CPU
  0.182350 seconds (14.10 M allocations: 1.822 GiB)
# GPU
 0.249538 seconds (3.44 M CPU allocations: 382.376 MiB) (3 GPU allocations: 126.038 MiB, 0.01% memmgmt time)

System:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 12 on 24 virtual cores
Environment:
  JULIA_REVISE_WORKER_ONLY = 1

NVIDIA GeForce RTX 3060 (GPU 0)

Is there anything wrong or is this expected?

Best,

Felix

@ChrisRackauckas
Copy link
Member

Your measurements look really weird. Are you sure that's what was actually ran? Do it in the REPL, not a Pluto notebook:

using DiffEqGPU, OrdinaryDiffEq, StaticArrays

function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] *- u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                    adaptive = false, dt = 0.1f0)
# 9.721686 seconds (13.30 M allocations: 1.775 GiB, 17.76% gc time)

DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,
                  adaptive = false, dt = 0.1f0)

# 0.022129 seconds (255.12 k CPU allocations: 33.413 MiB) (3 GPU allocations: 12.604 MiB, 0.09% memmgmt time)

#=
julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =
=#

with a 2080 Super. How your Ryzen 9 5900 is an order of magnitude faster and your 3060 is an order of magnitude slower is... weird to say the least.

@roflmaostc
Copy link
Author

roflmaostc commented Feb 14, 2023

@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,

did you measure with different numbers of trajectories?

I've been using Pluto, but below my REPL results.

As far as I can see, it uses 12 threads. Is my processor on steroids?

julia> @time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                           adaptive = false, dt = 0.1f0)
  0.199818 seconds (13.10 M allocations: 1.772 GiB)

julia> DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 100_000,
                         adaptive = false, dt = 0.1f0)
  0.189582 seconds (2.44 M CPU allocations: 330.496 MiB) (3 GPU allocations: 126.038 MiB, 0.16% memmgmt time)
EnsembleSolution Solution of length 100000 with uType:
ODESolution{Float32, 2, uType, Nothing, Nothing, tType, Nothing, P, A, IType, Nothing, Nothing} where {uType, tType, P, A, IType}


julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 12 on 24 virtual cores

@utkarsh530
Copy link
Member

This could be where copying GPU Arrays back to the CPU to build the EnsembleSolution is more expensive than the whole GPU solution? Try using lower-level API: https://docs.sciml.ai/DiffEqGPU/dev/tutorials/lower_level_api/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants