Skip to content

Tensor Usage and Data Types #34

Answered by turboderp
TNT3530 asked this question in Q&A
Discussion options

You must be logged in to vote

For long sequences (i.e. prompts) it dequantizes matrices and uses cuBLAS for matmul, and cuBLAS will no doubt use tensor cores when it's optimal. For token-by-token generation tensor cores don't make sense, though, since the hidden state ends up being a one-row vector. So you'd only be using the top row of each fragment.

And INT4 is only supported in tensor cores (not to be confused with int4 which is 4*32-bit type you can use to help with memory coalescing, as opposed to packing four ints into a structure). It also wouldn't really be the right pick here, because GPTQ values are quantized with scale and zero parameters so doing any math on the bare 4-bit values would give incorrect resul…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@TNT3530
Comment options

@turboderp
Comment options

Answer selected by TNT3530
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants