Skip to content

Commit

Permalink
llamafile_sgemm API - INT8 implementation
Browse files Browse the repository at this point in the history
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <[email protected]>
  • Loading branch information
amritahs-ibm committed Dec 20, 2024
1 parent d408bb9 commit 85c5280
Show file tree
Hide file tree
Showing 2 changed files with 649 additions and 69 deletions.
Loading

0 comments on commit 85c5280

Please sign in to comment.