You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Vectorize load instructions in dmmv f16 CUDA kernel (ggml-org#9816)
* Vectorize load instructions in dmmv f16 CUDA kernel
Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.
* addressed comment
* Update ggml/src/ggml-cuda/dmmv.cu
Co-authored-by: Johannes Gäßler <[email protected]>
---------
Co-authored-by: Johannes Gäßler <[email protected]>
0 commit comments