Skip to content

Kernel 13 implementation, even faster matmul with conflict-free registers -> smem storage without padding#15

Open
Aladoro wants to merge 2 commits intopranjalssh:mainfrom
Aladoro:padding-free-bank-conflict-free
Open

Kernel 13 implementation, even faster matmul with conflict-free registers -> smem storage without padding#15
Aladoro wants to merge 2 commits intopranjalssh:mainfrom
Aladoro:padding-free-bank-conflict-free

Conversation

@Aladoro
Copy link

@Aladoro Aladoro commented Feb 15, 2026

Kernel 13 gets rid of the padding introduced in kernel 12, and applies swizzling to the C tile when doing the register to shared memory transfer without bank conflicts.

In my tests with 8192-dimensional inputs, with this change, kernel 13 gets to 823.5 flops from 817.8 of kernel 12 and 809.1 of kernel 11.

Kudos to @gordicaleksa for posting one of the best explanations of swizzling out there and making me aware of your awesome repo ;)

On a side note, one of your comments stated:

"// We use 3d tiling to load from GMEM to SMEM. 2d tiling only works for tiles <= 64 columns."

This is a bit imprecise. In your previous implementation, I believe 2d tiling would have worked as well. The actual reason why 3D tiling is used is precisely to support swizzling with the CU_TENSOR_MAP_SWIZZLE_128B layout. Otherwise, if the fastest dimensions of a tile are not [64, columns], the swizzling pattern would be suboptimal at reducing bank conflicts when loading from and storing to shared memory.

Please do not hesitate to let me know if you have any questions, and thanks for sharing this repo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant