-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column Major Ordering #16
Comments
Great catch! I've tested both formulations (calculating with grid_m and grid_n) - both formulations appear to produce the same column-major movement. See the blog post with the PyTorch code snippet to see that indeed the output with this formulation produces |
I think there's something wrong with current code. When I simply swap the calculation of |
Hey, thanks for figuring this out! I'm still not clear why this is able to pass the test cases we throw at it -- so will need to dig a bit deeper on that end! |
From my test result, the address of intermediate_cache are always the same between the adjacent v0 and v2 runs. The |
If certain block calculations are skipped, ofc it gives a speedup, doesn't it? I also come across this problem when I are reading the code today. I am not sure why we apply certain ordering to launch the threadblock, it seems to be contradictive to CUDA programming |
The schedule determines the cache re-use pattern of your algorithm. The ordering is not unique and should be optimized for the type of problem sizes you are working with. A similar thing is done in CUDA. CUDA Link: https://developer.nvidia.com/blog/optimizing-compute-shaders-for-l2-locality-using-thread-group-id-swizzling/ |
@AdnanHoque @lessw2020
Thanks for the great blogpost and kernels.
In the column-major ordering:
Why is
pid_m = (pid % grid_n)
?grid_m
is the leading dimension (number ofblock
rows), so should it bepid_m = pid % grid_m
? Apologies if I'm misunderstanding the issue.The text was updated successfully, but these errors were encountered: