Tensor Usage and Data Types #34

TNT3530 · 2023-06-06T14:57:34Z

TNT3530
Jun 6, 2023

Does this actively use TensorCores on Volta+ architectures, or are you getting these speedups on CUDAs alone?
Also, are speedups seen at lower quantization levels due to native INT4 calls, or is it doing the math as FP16 (or some other type)?

Answered by turboderp

Jun 6, 2023

For long sequences (i.e. prompts) it dequantizes matrices and uses cuBLAS for matmul, and cuBLAS will no doubt use tensor cores when it's optimal. For token-by-token generation tensor cores don't make sense, though, since the hidden state ends up being a one-row vector. So you'd only be using the top row of each fragment.

And INT4 is only supported in tensor cores (not to be confused with int4 which is 4*32-bit type you can use to help with memory coalescing, as opposed to packing four ints into a structure). It also wouldn't really be the right pick here, because GPTQ values are quantized with scale and zero parameters so doing any math on the bare 4-bit values would give incorrect resul…

View full answer

turboderp · 2023-06-06T15:33:32Z

turboderp
Jun 6, 2023
Maintainer

For long sequences (i.e. prompts) it dequantizes matrices and uses cuBLAS for matmul, and cuBLAS will no doubt use tensor cores when it's optimal. For token-by-token generation tensor cores don't make sense, though, since the hidden state ends up being a one-row vector. So you'd only be using the top row of each fragment.

And INT4 is only supported in tensor cores (not to be confused with int4 which is 4*32-bit type you can use to help with memory coalescing, as opposed to packing four ints into a structure). It also wouldn't really be the right pick here, because GPTQ values are quantized with scale and zero parameters so doing any math on the bare 4-bit values would give incorrect results, or at best some bad rounding errors.

All the math is done in FP16. The main reasons quantized models inference faster, once you get over all the other potential bottlenecks, is that they require much less VRAM access. If you've got e.g. a 33B Llama model with FP16 weights, to do one pass through the model you have to read 66 GB of data from VRAM. So no matter what you're going to have an upper limit for tokens/second determined by the memory bandwidth of the GPU. The A100, for example, with a bandwidth of 1,555 GB/s can't exceed 24 tokens/second if each token requires 66 GB of VRAM access. With quantized weights that end up at 17 GB instead, the theoretical limit is 91 tokens/second instead.

Of course it's just the hard upper limit and the actual speed will always be lower because there's activations to read and write as well, including attention, but the main thing is that you have a lot less data to read. So all the fancy features like tensor cores and what have you end up being irrelevant. All that matters is that you get to the next VRAM access before the last one finishes.

2 replies

TNT3530 Jun 6, 2023
Author

So ignoring VRAM limitations and current compatibility, something like the AMD Instinct MI100 should (in theory) outperform an RTX 4090 due to the massive MI100 8:1 FP16 performance ratio (184.6 TFLOPS vs 82.58 TFLOPS)?
Or if cuBLAS uses the TensorCores more often, the effective FLOPS of the 4090 bumps up to ~330 TFLOPS, beating AMD once more?

This stuff is fascinating, thanks for the detailed breakdown!
I'm looking forward to testing the performance on my CMP 50HX when I have time, and possibly the MI25 I have lying around if ROCm support gets added.

turboderp Jun 6, 2023
Maintainer

So ignoring VRAM limitations and current compatibility, something like the AMD Instinct MI100 should (in theory) outperform an RTX 4090 due to the massive MI100 8:1 FP16 performance ratio (184.6 TFLOPS vs 82.58 TFLOPS)?

Well, no. It should outperform the 4090 by 20% or so, due to the higher memory bandwidth, once you get past the compatibility issues. It'll never get to use all those TFLOPS when generating text. During training it matters a lot more, of course. Then VRAM bandwidth stops being the limiting factor as every forward (or backward) pass you run works on a long sequence of tokens instead of just one token.

Running it on a mining card sounds interesting, though. You won't fit a whole lot of weights on 10 GB, but it should run 13B, I guess.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor Usage and Data Types #34

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Tensor Usage and Data Types #34

TNT3530 Jun 6, 2023

Replies: 1 comment · 2 replies

turboderp Jun 6, 2023 Maintainer

TNT3530 Jun 6, 2023 Author

turboderp Jun 6, 2023 Maintainer

TNT3530
Jun 6, 2023

Replies: 1 comment 2 replies

turboderp
Jun 6, 2023
Maintainer

TNT3530 Jun 6, 2023
Author

turboderp Jun 6, 2023
Maintainer