Part 1: 3D multi-XPUs diffusion solver

The goal of this part is to implement the 3D diffusion equation:

using the dual-time method where the physical time-derivative (dt) is defined as physical term and we use pseudo-time (τ) to iterate the solution:

And we are also interested in a steady state solution:

In both cases will also make use of acceleration/damping to enforce scaling of the pseudo-transient iterations.

The next goal is to implement a version of this PDE solution that can run on multiple CPUs and/or GPUs, on a single Computer or even a distributed system. For this task we will make use of ParallelStencil.jl which enables us to write code that can be deployed both on a CPU or a GPU (or short XPU). Also we will use ImplicitGlobalGrid.jl for distributed parallelization (using MPI) of the XPU solution.

Using the different implementations we will then perform some performance test and also do some scaling experiments.

Methods

(Pseudo) Temporal Discretation

As previously stated, the goal is to solve the linear diffusion equation in 3 Dimensions using the dual-time method, where the physical time-derivative (dt) is defined as physical term and we use pseudo-time (τ) to iterate the solution:

Which implies the following (pseudo-time) discretisation:

For practical purposes we divide the solution into 3 steps.

First we Calculate H_Res:

And in the case where we want to get the steady state solution, we calculate H_Res this way:

Since we use damping, we update dH/dτ in the second step with a damping parameter:

Finally we update H:

Sidenote on Numerical Differentiation

We used the FiniteDifferences3D submodule from ParallelStencil.jl for the numerical differentiation. Like the name suggests, this submodule provides numerical differentiation via the Finite Differences Method.

Here is an example of a pratial numerical derivation of H in direction x:

Equivalently here is a pratial numerical 2nd derivation of H in direction x:

Using this submodule provied us with the following abstraction:

And thus the 2nd derivative of H in direction x can be written as:

Optimizations and Spatial Discretation

To optimize we will replace all divisions with inverse multiplications, since multiplication is much faster than division. Also we will transform the equation so we get the least amount of operations possible. And precompute values where possible. So lets look again at ResH:

And since we solve this Equation in 3 dimentions, we get this equivalent formula:

Then we use the abstraction described in the Numerical Differentiation sidenote to discretize the PDE in the spatial domain:

And now the optimizations described above are pretty straight forward:

Since we cant really optimize the other two steps, we now have the following three calculation steps in each iteration:

# Step 1 (dual time)
@all(ResH) = -(@inn(H) - @inn(Hold)) * _dt + (@d2_xi(H)*D_dx² + @d2_yi(H)*D_dy² + @d2_zi(H)*D_dz²)

# Step 1 (steady state, only pseudo time)
@all(ResH) = @d2_xi(H)*D_dx² + @d2_yi(H)*D_dy² + @d2_zi(H)*D_dz²

# Step 2
@all(dHdt) = @all(ResH) + damp * @all(dHdt)

# Step 3
@inn(H) = @inn(H) + dτ * @all(dHdt)

Solution approach

As the name "dual-time" suggests, we iterate over 2 types of times. The physical time and the pseudo time. In this case the physical timestep and the total physical time is given by: dt = 0.2 and ttot = 1.0 (both in seconds). So we iterate over the physical time and in each physical timestep t we iterate over the pseudo time τ until the L2-norm of the equation's residual (norm(ResH)/sqrt(length(ResH)) is smaller then the absolute tolerance given by: tol = 1e-8. Then we increment t by dt and update H^t with the value of H^τ And if t < ttot we start the pseudo time loop again. Here is a code snippet that illustrates the dual time loop:

#(...)
t = 0

# Physical time loop
while t<ttot

    it_τ = 0

    # Pseudo-transient iteration
    while err>tol && it_τ<itMax
        
        # Calculate ResH, dHdt and H
        #(...)
        
        # Calculate error
        #(...)

        it_τ += 1
    end

    # update physical time step
    t += dt

    #(...)
end
#(...)

And if we just want the steady state solution, we just need the pseudo-transient iteration loop:

#(...)
it_τ = 0

# Pseudo-transient iteration
while err>tol && it_τ<itMax

    # Calculate ResH, dHdt and H
    #(...)

    # Calculate error
    #(...)

    it_τ += 1
end
#(...)

Hardware

The hardware used to perform all simultaions and scaling experiments:

CPU: Intel® Core™ i7-11700 (8C16T, T_peak=50GB/s [1])
GPU: NVIDIA GeForce RTX 3060 (T_peak=360 GB/s, 199 GFLOPS (FP64) [2])

Except for the Multi GPU Experiment, where we used:

Multi-GPU Node: 4x NVIDIA GeForce GTX TITAN X (T_peak=480 GB/s, 209 GFLOPS (FP64) [3])

Results

Dual Time Solution with grid size 512x512x256 - 2D slice at z=128 - using 4x NVIDIA GeForce GTX TITAN X

Generated by executing: julia --project -O3 --check-bounds=no -t 4 ./scripts-part1/diffusion3D_visualize.jl.

Performance

We are using the performance metric proposed in the ParallelStencil.jl library. (T_eff = A_eff/t_it)

In this case, the A_eff metric was calculated as follows:

reads = length(Hᵗ)                                     # Read Only Memory Access: Hᵗ
updates = length(H) + length(dHdt)                     # Update Memory access: H and dHdt
A_eff = 1e-9 * (2 * updates + reads) * sizeof(Float64) # Effective main memory access per iteration [GB]

We do not count ResH as it is a convenience array and thus could be skipped.

To find out if this problem is compute or memory bound, we have to look at the ratio between compute access speed and memory access speed. In this case we are using a NVIDIA GeForce RTX 3060[2]:

So in this case we would be memory bound if we have less than 4 floating point operations for every memory access. But looking again at how ResH is calculated:

# Step 1 (dual time)
@all(ResH) = -(@inn(H) - @inn(Hold)) * _dt + (@d2_xi(H)*D_dx² + @d2_yi(H)*D_dy² + @d2_zi(H)*D_dz²)

We see that for the 3 memory accesses we have 18 operations. (@d2_xi() contains 3 differences, as described in the numerical differentiation sidenote) So we would have a ratio of 18/3 = 6. Thus we are very likely to be compute bound for this problem. This is also true for the weak scaling experiment, where we use the NVIDIA GeForce GTX TITAN X[3], which has a slightly lower ratio of:

CPU Performance

An Intel® Core™ i7-11700 (8C16T, T_peak=50GB/s [1]) was used for the CPU performance benchmark. We executed the diffusion3D_benchmark_cpu.jl 4 times, with 1, 4, 8 and 16 threads respectively.

All of the 4 plots were generated by running julia --project -O3 --check-bounds=no -t <num_threads> ./scripts-part1/diffusion3D_benchmark_cpu.jl, where <num_threads> is replaced by the number of Threads.

We see here an interesting phenomenon with the 32 and especially the 64 grid size, when we increase the number of threads. The T_eff values are significantly higher compared to those with higher grid sizes. This behavior is probably the result of good caching from the CPU, since a grid size of 32 and 64 fit nicely in the Cache. The Cache size is about 16MB[1], and we allocate 8 Arrays. So a grid size of 64: 8 * (64^3 * 4) ≈ 8MB fits well in a 16MB cache. But a grid size of 128: 8 * 128^3 * 4 ≈ 67MB, does of course not fit. So the true T_eff would be reflected by a grid size of 128 and up.

GPU Performance

An NVIDIA GeForce RTX 3060 (T_peak=360 GB/s, 199 GFLOPS (FP64) [2]) was used for the GPU performance benchmark. Using the diffusion3D_benchmark_gpu.jl script, also shows us the optimal local problem size, which we want to use later for the Multi GPU scaling experiment. In this case 256 looks like the optimal problem size.

Generated with julia --project -O3 --check-bounds=no ./scripts-part1/diffusion3D_benchmark_gpu.jl.

Weak scaling Experiment

Now that we have the optimal local problem size for a GPU (256), we will run a week scaling experiment using multiple GPUs. (4x NVIDIA GeForce GTX TITAN X[3]) To assess the performance when scaling to multiple GPUs we created two plots:

Number of GPUs vs Effective memory throughput	Number of GPUs vs Parallel efficiency ([time using n GPUs]/[time using 1 GPU])

The plots were generated by executing the diffusion3D_benchmark_multigpu.jl script 4 times:

~/.julia/bin/mpiexecjl -n 1 julia --project -O3 --check-bounds=no ./scripts-part1/diffusion3D_benchmark_multigpu.jl
~/.julia/bin/mpiexecjl -n 2 julia --project -O3 --check-bounds=no ./scripts-part1/diffusion3D_benchmark_multigpu.jl
~/.julia/bin/mpiexecjl -n 3 julia --project -O3 --check-bounds=no ./scripts-part1/diffusion3D_benchmark_multigpu.jl
~/.julia/bin/mpiexecjl -n 4 julia --project -O3 --check-bounds=no ./scripts-part1/diffusion3D_benchmark_multigpu.jl

The NVIDIA GeForce GTX TITAN X[3] and NVIDIA GeForce RTX 3060[2] GPUs are on paper pretty similar in terms of 64 bit FLOPs
(209 vs 199) and also their compute access speed to memory access speed ratio (both ≈4). Since we earlier established that the problem is likely to be compute bound, we therefore expected about the same or slightly worse performance, which one can verify with the plots. (T_eff ≈ 130 GB/s on the RTX 3060 vs T_eff ≈ 125 GB/s on the Titan X, with a grid resolution of 256)

Work-precision diagrams

The plots below were generated by executing the diffusion3D_work_precision.jl script: julia --project -O3 --check-bounds=no ./scripts-part1/diffusion3D_work_precision.jl.

Iterations to steady state vs Grid size

For this problem, the cubic root of the number of grid points, seems to scale linearly with the amount of iterations needed to reach steady state with a convergence tolerance of tol=1e-8. So increasing the number of grid points is not that costly, if we are interested in the steady state solution.

Value at domain point (5,5,5) vs Grid size

This plot shows that after a grid size of 256x256x256, the value at the point (5,5,5) does not change much when increasing the grid size. It is almost negligible when compared to the changes we see in the lower grid sizes. This means that using a grid resolution of 256 earlier, was probably a good choice in this respect.

Tolerance vs Convergence Behaviour to well converged solution

We considered a tolerance of tol=1e-24 as a well converged solution, which is still numerically stable. Also decreasing the the tolerance further resulted in very long runtimes. So for practical purposes we compared in this plot the absolute (L2 norm of absolute difference) and relative (L2 norm of absolute difference / L2 norm of well converged solution) convergence behavior in comparison with the solution obtained with a tolerance of tol=1e-24. And as you can see decreasing the tolerance by a factor of 1/10 scales linearly in terms of the relative and absolute error.

References

[1] https://ark.intel.com/content/www/us/en/ark/products/212279/intel-core-i711700-processor-16m-cache-up-to-4-90-ghz.html

[2] https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682

[3] https://www.techpowerup.com/gpu-specs/geforce-gtx-titan-x.c2632

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

part1.md

part1.md

Part 1: 3D multi-XPUs diffusion solver

Methods

(Pseudo) Temporal Discretation

Optimizations and Spatial Discretation

Solution approach

Hardware

Results

Performance

CPU Performance

GPU Performance

Weak scaling Experiment

Work-precision diagrams

References

Files

part1.md

Latest commit

History

part1.md

File metadata and controls

Part 1: 3D multi-XPUs diffusion solver

Methods

(Pseudo) Temporal Discretation

Optimizations and Spatial Discretation

Solution approach

Hardware

Results

Performance

CPU Performance

GPU Performance

Weak scaling Experiment

Work-precision diagrams

References