Regarding comment on benchmark memory bandwidth speeds #24

tjbortz1s · 2023-06-02T19:20:37Z

tjbortz1s
Jun 2, 2023

2023-05-22: Added option to dequantize layers at load-time which should speed up inference, but it turns out Torch's fp16 matmul is actually slower than the quantized matmul. Maybe bandwidth is the only bottleneck right now? Need to experiment some more.
I'm not actually sure this is relevant, but I remember there being something in the GPTQ paper where they say inference is faster thanks to memory bandwidth limitations.

Next, we consider language generation, one of the most appealing applications of these models, with
the goal of latency reduction. Unlike LLM.int8(), which reduces memory costs but has the same
runtime as the FP16 baseline, we show that our quantized models can achieve significant speedups
for this application. For language generation, the model processes and outputs one token at-a-time,
which for OPT-175B can easily take a few 100s of milliseconds per token. Increasing the speed at
which the user receives generated results is challenging, as compute is dominated by matrix-vector
products. Unlike matrix-matrix products, these are primarily limited by memory bandwidth. We
address this problem by developing a quantized-matrix full-precision-vector product kernel which
performs a matrix vector product by dynamically dequantizing weights when needed. Most notably,
this does not require any activation quantization. While dequantization consumes extra compute,
the kernel has to access a lot less memory, leading to significant speedups

at the same time, we emphasize some significant limitations: On the technical side,
our method obtains speedups from reduced memory movement, and does not lead to computational
reductions. In addition, our study focuses on generative tasks, and does not consider activation
quantization

turboderp · 2023-06-02T19:25:45Z

turboderp
Jun 2, 2023
Maintainer

Yep, that was an experimental feature. It's not useful, as it turns out, because inference ends up being bandwidth-limited for precisely those reasons mentioned in the paper. I'm just keeping the feature for now for testing purposes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding comment on benchmark memory bandwidth speeds #24

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Regarding comment on benchmark memory bandwidth speeds #24

tjbortz1s Jun 2, 2023

Replies: 1 comment

turboderp Jun 2, 2023 Maintainer

tjbortz1s
Jun 2, 2023

turboderp
Jun 2, 2023
Maintainer