Skip to content

Commit

Permalink
llamafile_sgemm API - INT8 implementation
Browse files Browse the repository at this point in the history
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <[email protected]>
  • Loading branch information
amritahs-ibm committed Dec 20, 2024
1 parent d408bb9 commit d70f5fc
Showing 1 changed file with 643 additions and 69 deletions.
Loading

0 comments on commit d70f5fc

Please sign in to comment.