ROCm 5.4.0 Use docker image rocm/pytorch:rocm5.4_ubuntu20.04_py3.7_pytorch_1.12.1
MI250X
cd ck
We change the repeat time each instance in tuning to 100. (build_ck.sh line8)
./build_ck.sh
Because of CK doesn't provide profiler of gemm_fast_gelu, use gemm_add_add_fast_gelu to profile.
./test_gemm.sh
./test_gemm_fast_gelu.sh
All results are in logs file under ./composable_kernel/build/bin/
cd ort
./build_ort.sh
./test_gemm.sh
./test_gemm_fast_gelu.sh
The default initialilzation is demical initialization. Please modify related test file to change the initialization method. See more details on test_gemm.sh and test_gemm_fast_gelu.sh.
cd hipblaslt
./build_hipblaslt.sh
./test_gemm.sh
./test_gemm_gelu.sh
float16 M=49152 N=3072 K=768 notrans notrans
We record the performance of instance DeviceGemmXdl<256, 128, 128, 4, 8, 32, 32, 2, 2> NumPrefetch: 1, LoopScheduler: Interwave, PipelineVersion: v1, which is the best instance selected by CK.
init method | ORT(ms) | CK(ms) | hipBLASLt |
---|---|---|---|
zero | 1.582 | 1.5806 | |
integer | 1.651 | 1.6828 | 1.6595 |
demical | 2.071 | 1.81792 | 1.7734 |
We record the performance of instance DeviceGemmMultipleD_Xdl_CShuffle<256, 128, 128, 32, 8, 8, Default> LoopScheduler: Interwave, PipelineVersion: v1 , which is the best instance selected by CK.
init method | ORT(ms) | CK(ms) | hipBLASLt |
---|---|---|---|
zero | 1.885 | 1.871 | |
integer | 1.956 | 2.001 | 1.7829 |
demical | 2.422 | 2.184 | 1.8188 |