[DRAFT][s4xbf16] JoinOp vectorization patch #22

ggengnv · 2025-02-26T00:15:40Z

Do not merge until rebased on upstream Triton up to triton-lang@c1ed673

This is 1 of the 2 patches needed to improve int4xbf16 GEMM perf.

This is needed because joinOp by default interleaves every element of the two input matrices. In the case of bf16, this means Triton will extract the 2x bf16 values out of the 32-bit register and re-insert them into a new register. This results in many mov instructions before MMA. On certain shapes, this could mean a ~10% perf penalty.

This PR addresses the above by situationally "vectorizing" the interleaving; namely, join every two elements instead of one. This avoids the need to extract values out of registers. Of course, this would also require one to modify the inline_asm logic before the join to produce the correct layout.

cc @gflegar

ggengnv · 2025-02-26T00:46:53Z

For small-M shapes, for best perf, we'll additionally want XLA to swap A/B so that the LHS of dot is quantized, and then set envvar DISABLE_MMA_V3 to force Ampere-MMA.

Add joinOp vectorization

0b43a35

ggengnv changed the title ~~[DRAFT] JoinOp vectorization patch~~ [DRAFT][s4xbf16] JoinOp vectorization patch Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT][s4xbf16] JoinOp vectorization patch #22

[DRAFT][s4xbf16] JoinOp vectorization patch #22

ggengnv commented Feb 26, 2025

ggengnv commented Feb 26, 2025

[DRAFT][s4xbf16] JoinOp vectorization patch #22

Are you sure you want to change the base?

[DRAFT][s4xbf16] JoinOp vectorization patch #22

Conversation

ggengnv commented Feb 26, 2025

ggengnv commented Feb 26, 2025