Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4? #4

Open
brisker opened this issue Aug 11, 2023 · 1 comment

Comments

@brisker
Copy link

brisker commented Aug 11, 2023

Nice work in this paper, I want to know that:
the paper mentioned that all linear ops are quantized into int4, what about mat-multiply ops in the attention module? Is the activation gradient in matmul ops float or int4?

@brisker brisker changed the title the paper mentioned that all linear ops are quantized into int4, what about mat-multiply ops in the attention module? Float or int4? the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4? Aug 11, 2023
@brisker
Copy link
Author

brisker commented Aug 17, 2023

@xijiu9
besides, in the grad_weight calculation process, the code here seems to be not int4 matmul, since sample_x3 is divided by norm_weight_loop after quantized into INT4 here. The code is a little confusing to me, since I can not quite understand: norm_weight_loop ,which is in N*1 shape is involved in the backprop, is your int4 matmul per-channel(batch-channel) quantization? But still this can not be done in hardward(or this will lose the accelerating meaning of quantization) since Cout * N(activation gradient) and N * Cin(input activation) matmul can not be per-channel quantized at N(batch) level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant