-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlashMLA from DeepSeek #892
Comments
I went here for it ! @zhyncs was really fast |
#887 how about this?compare vs https://github.com/deepseek-ai/FlashMLA ? |
The pipeline design is a little bit different from #887, I'll check what we can learn from it. |
@zhyncs @celsowm @MichoChan here is the result I got on H100, by running the latest flashinfer and FlashMLA mainline (the higher the better), for flashinfer we use page_size=1 and FlashMLA uses page_size=64. |
Here's my benchmark code and result on H100: https://gist.github.com/abcdabcd987/b215c5f00f4b5e8399b95d7933bcf475 https://docs.google.com/spreadsheets/d/1t0Txa7Ph9u7Su9LyWpS24vqr9A5FB-FyL0EZNpYOqwg/edit?gid=0#gid=0 Both are using page size 64. FlashMLA is faster in general, way faster on small batch sizes. |
As pointed in #892 (comment) The second stage of split-k seems to have a huge overhead. This PR is the first second in addressing these issues, by changing the vector size from 4 to 8.
Hi @abcdabcd987 , yes I didn't profiled the low batch size use cases, and I just realized we get low performance for small batch and long context. #894 alleviate the issue a little bit. Regarding the cases (qo_len * num_heads >= 128), the current flashinfer implementation is not good at this, because we prioritize |
I found DeepSeek FlashMLA is very very faster than flashinfer when q_head_num equals to 128 (tp1) , almost faster 100% when bs=32. but when q_head_num equals to [16 32 64], faster 10%-20%. |
We will try out the FlashMLA-style warp specialization in the next release. Created an issue for performance tracking: #897 |
As observed in #892 , we found flashinfer mla's second stage of split-k is very slow (when batch size is small), this is because our scheduler only uses one CTA for the second stage of split-k. This PR fixes the issue.
Hello, I noticed the significant speed improvement in the latest test results, but the test script throws errors when running with the new version of FlashInfer. If modifications are needed for the test script? |
@yanghailong-git can you report the error message? |
When running this script https://gist.github.com/abcdabcd987/b215c5f00f4b5e8399b95d7933bcf475 with version v0.2.2.post1, I encountered the error below. How should I resolve this? Thanks. ![]() |
Can you post the full error message in text instead, some key information were clipped in your screenshot. |
The detailed error is as follows:
|
@yanghailong-git #904 should fix it. |
as titled
ref https://github.com/deepseek-ai/FlashMLA
The text was updated successfully, but these errors were encountered: