The repo describes how to reach 95% of the speed of CuBLAS for matrix multiplication with half-floats in three simple steps. The TFlops of the three different kernels and the reference CuBLAS code are shown below:
The kernels have been obtained by reverse-engineering the SASS code of the CuBLAS code. The process is described here in greater detail.
The code gets compiled simply with nvcc src/matrix -Iinclude -lcublas -arch=sm_86
. Running the three kernels and the reference kernel can be done by calling ./a.out -1
, which produces:
root@c1cdc7e26eb8:/workspace/matrix# ./a.out -1
runs with all ids
Kernel: CublasRefernece
Average elapsed time: (1.462106) ms, performance: (94000.7) GFLOPS. size: (4096).
Kernel: BasicGEMM
Average elapsed time: (2.308537) ms, performance: (59535.1) GFLOPS. size: (4096).
Kernel: Buffering
Average elapsed time: (1.891883) ms, performance: (72646.6) GFLOPS. size: (4096).
Kernel: DoubleBuffering
Average elapsed time: (1.500277) ms, performance: (91609.0) GFLOPS. size: (4096).
The code targets the Ampere architecture. The Hopper architecture introduced new instructions (most notably produce and consumer warps).
There are several excellent repos from which I learned a lot. Most notably: