-
Hi, I know that quantisation will introduced some performance regressions. But does exllama introduce any additional performance regressions? Or in theory should it preserve the same performance as any other inference of those quantised weights? In other words should I be able to get the same logits whether I use exllama for inference or another quantisation inference library? Im assuming it is loss-less but just wanted to double check. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
ExLlama isn't doing approximate attention or anything like that, but it is using FP16 math in some places where other implementations do FP32. I've managed to create a comparative benchmark with GPTQ-for-LLaMA that shows some small but measurable differences in perplexity. I still have to validate that there isn't an off-by-one error or whatever skewing the results, and crucially the differences get (a lot) smaller the larger the model gets. Sometimes they even come out in ExLlama's favor, so it's not that the FP16 math strictly hurts perplexity. It's also not clear if it's a relevant comparison to make in the first place, since all of the logits contribute when calculating perplexity, but most of them (the noise on the tail end on the probability distribution) are completely ignored during generation. I need to do a little more work on that side of things. But the short answer is no, it's not entirely lossless in the way you describe. It's more similar to evaluating a GTPQ-patched HF model after |
Beta Was this translation helpful? Give feedback.
ExLlama isn't doing approximate attention or anything like that, but it is using FP16 math in some places where other implementations do FP32.
I've managed to create a comparative benchmark with GPTQ-for-LLaMA that shows some small but measurable differences in perplexity. I still have to validate that there isn't an off-by-one error or whatever skewing the results, and crucially the differences get (a lot) smaller the larger the model gets. Sometimes they even come out in ExLlama's favor, so it's not that the FP16 math strictly hurts perplexity.
It's also not clear if it's a relevant comparison to make in the first place, since all of the logits contribute when calculating perplexity, bu…