-
Does this actively use TensorCores on Volta+ architectures, or are you getting these speedups on CUDAs alone? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
For long sequences (i.e. prompts) it dequantizes matrices and uses cuBLAS for matmul, and cuBLAS will no doubt use tensor cores when it's optimal. For token-by-token generation tensor cores don't make sense, though, since the hidden state ends up being a one-row vector. So you'd only be using the top row of each fragment. And INT4 is only supported in tensor cores (not to be confused with int4 which is 4*32-bit type you can use to help with memory coalescing, as opposed to packing four ints into a structure). It also wouldn't really be the right pick here, because GPTQ values are quantized with scale and zero parameters so doing any math on the bare 4-bit values would give incorrect results, or at best some bad rounding errors. All the math is done in FP16. The main reasons quantized models inference faster, once you get over all the other potential bottlenecks, is that they require much less VRAM access. If you've got e.g. a 33B Llama model with FP16 weights, to do one pass through the model you have to read 66 GB of data from VRAM. So no matter what you're going to have an upper limit for tokens/second determined by the memory bandwidth of the GPU. The A100, for example, with a bandwidth of 1,555 GB/s can't exceed 24 tokens/second if each token requires 66 GB of VRAM access. With quantized weights that end up at 17 GB instead, the theoretical limit is 91 tokens/second instead. Of course it's just the hard upper limit and the actual speed will always be lower because there's activations to read and write as well, including attention, but the main thing is that you have a lot less data to read. So all the fancy features like tensor cores and what have you end up being irrelevant. All that matters is that you get to the next VRAM access before the last one finishes. |
Beta Was this translation helpful? Give feedback.
For long sequences (i.e. prompts) it dequantizes matrices and uses cuBLAS for matmul, and cuBLAS will no doubt use tensor cores when it's optimal. For token-by-token generation tensor cores don't make sense, though, since the hidden state ends up being a one-row vector. So you'd only be using the top row of each fragment.
And INT4 is only supported in tensor cores (not to be confused with int4 which is 4*32-bit type you can use to help with memory coalescing, as opposed to packing four ints into a structure). It also wouldn't really be the right pick here, because GPTQ values are quantized with scale and zero parameters so doing any math on the bare 4-bit values would give incorrect resul…