Kernel 13 implementation, even faster matmul with conflict-free registers -> smem storage without padding#15
Open
Aladoro wants to merge 2 commits intopranjalssh:mainfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Kernel 13 gets rid of the padding introduced in kernel 12, and applies swizzling to the C tile when doing the register to shared memory transfer without bank conflicts.
In my tests with 8192-dimensional inputs, with this change, kernel 13 gets to 823.5 flops from 817.8 of kernel 12 and 809.1 of kernel 11.
Kudos to @gordicaleksa for posting one of the best explanations of swizzling out there and making me aware of your awesome repo ;)
On a side note, one of your comments stated:
"// We use 3d tiling to load from GMEM to SMEM. 2d tiling only works for tiles <= 64 columns."
This is a bit imprecise. In your previous implementation, I believe 2d tiling would have worked as well. The actual reason why 3D tiling is used is precisely to support swizzling with the CU_TENSOR_MAP_SWIZZLE_128B layout. Otherwise, if the fastest dimensions of a tile are not [64, columns], the swizzling pattern would be suboptimal at reducing bank conflicts when loading from and storing to shared memory.
Please do not hesitate to let me know if you have any questions, and thanks for sharing this repo!