Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column Major Ordering #16

Open
jeromeku opened this issue Apr 5, 2024 · 6 comments
Open

Column Major Ordering #16

jeromeku opened this issue Apr 5, 2024 · 6 comments

Comments

@jeromeku
Copy link

jeromeku commented Apr 5, 2024

@AdnanHoque @lessw2020

Thanks for the great blogpost and kernels.

In the column-major ordering:

    grid_m = tl.cdiv(m, block_m)
    grid_n = tl.cdiv(n, block_n)

    pid_m = (pid % grid_n)
    pid_n = pid // grid_m

Why is pid_m = (pid % grid_n)?

grid_m is the leading dimension (number of block rows), so should it be pid_m = pid % grid_m? Apologies if I'm misunderstanding the issue.

@AdnanHoque
Copy link
Contributor

AdnanHoque commented Apr 7, 2024

Great catch!

I've tested both formulations (calculating with grid_m and grid_n) - both formulations appear to produce the same column-major movement. See the blog post with the PyTorch code snippet to see that indeed the output with this formulation produces pid_m, pid_n = (0, 0), (1, 0), (2, 0) etc.

@chu-tianxiang
Copy link

chu-tianxiang commented Apr 8, 2024

I think there's something wrong with current code. When I simply swap the calculation of fused_moe_v0 and fused_moe_v2 (I replaced v1 with v2) in test_moe_gemm.py, the allclose assertions would fail.
Changing pid_m = (pid % grid_n) to pid_m = (pid % grid_m) would make the test pass, but the performance will be almost the same with current vllm implementation.

@AdnanHoque
Copy link
Contributor

Hey, thanks for figuring this out!
Yes indeed. With this fix it appears that the speedup from the column-major optimization is gone unfortunately.

I'm still not clear why this is able to pass the test cases we throw at it -- so will need to dig a bit deeper on that end!

@chu-tianxiang
Copy link

From my test result, the address of intermediate_cache are always the same between the adjacent v0 and v2 runs. The grid_m problem will cause some block calculations to be skipped, but we still get the right value from last v0 run.
I don't know how torch allocator works and how to avoid the problem. I tried empty_cache between the runs but no luck.

@hgl71964
Copy link

hgl71964 commented Apr 18, 2024

If certain block calculations are skipped, ofc it gives a speedup, doesn't it?

I also come across this problem when I are reading the code today. I am not sure why we apply certain ordering to launch the threadblock, it seems to be contradictive to CUDA programming

@AdnanHoque
Copy link
Contributor

AdnanHoque commented Apr 18, 2024

The schedule determines the cache re-use pattern of your algorithm. The ordering is not unique and should be optimized for the type of problem sizes you are working with. A similar thing is done in CUDA.

CUDA Link: https://developer.nvidia.com/blog/optimizing-compute-shaders-for-l2-locality-using-thread-group-id-swizzling/
Triton Link: https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#l2-cache-optimizations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants