Is Exllama loss-less? #80

ri938 · 2023-06-20T10:35:59Z

ri938
Jun 20, 2023

Hi, I know that quantisation will introduced some performance regressions. But does exllama introduce any additional performance regressions? Or in theory should it preserve the same performance as any other inference of those quantised weights?

In other words should I be able to get the same logits whether I use exllama for inference or another quantisation inference library?

Im assuming it is loss-less but just wanted to double check.

Answered by turboderp

Jun 20, 2023

ExLlama isn't doing approximate attention or anything like that, but it is using FP16 math in some places where other implementations do FP32.

I've managed to create a comparative benchmark with GPTQ-for-LLaMA that shows some small but measurable differences in perplexity. I still have to validate that there isn't an off-by-one error or whatever skewing the results, and crucially the differences get (a lot) smaller the larger the model gets. Sometimes they even come out in ExLlama's favor, so it's not that the FP16 math strictly hurts perplexity.

It's also not clear if it's a relevant comparison to make in the first place, since all of the logits contribute when calculating perplexity, bu…

View full answer

turboderp · 2023-06-20T12:58:49Z

turboderp
Jun 20, 2023
Maintainer

ExLlama isn't doing approximate attention or anything like that, but it is using FP16 math in some places where other implementations do FP32.

I've managed to create a comparative benchmark with GPTQ-for-LLaMA that shows some small but measurable differences in perplexity. I still have to validate that there isn't an off-by-one error or whatever skewing the results, and crucially the differences get (a lot) smaller the larger the model gets. Sometimes they even come out in ExLlama's favor, so it's not that the FP16 math strictly hurts perplexity.

It's also not clear if it's a relevant comparison to make in the first place, since all of the logits contribute when calculating perplexity, but most of them (the noise on the tail end on the probability distribution) are completely ignored during generation.

I need to do a little more work on that side of things. But the short answer is no, it's not entirely lossless in the way you describe. It's more similar to evaluating a GTPQ-patched HF model after model.half(). It's not even entirely deterministic since the CUDA runtime offers no guarantees about the order in which it launches kernels in a grid, and FP addition is non-associative.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Exllama loss-less? #80

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is Exllama loss-less? #80

ri938 Jun 20, 2023

Replies: 1 comment

turboderp Jun 20, 2023 Maintainer

ri938
Jun 20, 2023

turboderp
Jun 20, 2023
Maintainer