the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4? #4

brisker · 2023-08-11T08:57:04Z

Nice work in this paper, I want to know that:
the paper mentioned that all linear ops are quantized into int4, what about mat-multiply ops in the attention module? Is the activation gradient in matmul ops float or int4?

brisker · 2023-08-17T07:49:38Z

@xijiu9
besides, in the grad_weight calculation process, the code here seems to be not int4 matmul, since sample_x3 is divided by norm_weight_loop after quantized into INT4 here. The code is a little confusing to me, since I can not quite understand: norm_weight_loop ,which is in N*1 shape is involved in the backprop, is your int4 matmul per-channel(batch-channel) quantization? But still this can not be done in hardward(or this will lose the accelerating meaning of quantization) since Cout * N(activation gradient) and N * Cin(input activation) matmul can not be per-channel quantized at N(batch) level

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4? #4

the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4? #4

brisker commented Aug 11, 2023 •

edited

Loading

brisker commented Aug 17, 2023 •

edited

Loading

the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4? #4

the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4? #4

Comments

brisker commented Aug 11, 2023 • edited Loading

brisker commented Aug 17, 2023 • edited Loading

brisker commented Aug 11, 2023 •

edited

Loading

brisker commented Aug 17, 2023 •

edited

Loading