llamafile_sgemm API - INT8 implementation #10912

amritahs-ibm · 2024-12-20T05:33:43Z

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <[email protected]>

amritahs-ibm · 2024-12-20T07:11:30Z

Hi @ggerganov,
Can you please help reviewing this PR. Or suggest any missing actions required from me to get this patch reviewed.

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 20, 2024

amritahs-ibm force-pushed the sgemm_q8 branch from 3c60b12 to 85c5280 Compare December 20, 2024 05:35

amritahs-ibm force-pushed the sgemm_q8 branch from 85c5280 to d70f5fc Compare December 20, 2024 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamafile_sgemm API - INT8 implementation #10912

llamafile_sgemm API - INT8 implementation #10912

amritahs-ibm commented Dec 20, 2024 •

edited

Loading

amritahs-ibm commented Dec 20, 2024

llamafile_sgemm API - INT8 implementation #10912

Are you sure you want to change the base?

llamafile_sgemm API - INT8 implementation #10912

Conversation

amritahs-ibm commented Dec 20, 2024 • edited Loading

amritahs-ibm commented Dec 20, 2024

amritahs-ibm commented Dec 20, 2024 •

edited

Loading