Replies: 1 comment
-
Yep, that was an experimental feature. It's not useful, as it turns out, because inference ends up being bandwidth-limited for precisely those reasons mentioned in the paper. I'm just keeping the feature for now for testing purposes. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
2023-05-22: Added option to dequantize layers at load-time which should speed up inference, but it turns out Torch's fp16 matmul is actually slower than the quantized matmul. Maybe bandwidth is the only bottleneck right now? Need to experiment some more.
I'm not actually sure this is relevant, but I remember there being something in the GPTQ paper where they say inference is faster thanks to memory bandwidth limitations.
Beta Was this translation helpful? Give feedback.
All reactions