Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

flashinfer-ai / flashinfer Public

Notifications You must be signed in to change notification settings
Fork 240
Star 2.3k

Code
Issues 86
Pull requests 11
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Tracking Issue] MLA performance tracking #897

Open

5 of 10 tasks

yzh119 opened this issue Feb 24, 2025 · 0 comments

Open

5 of 10 tasks

[Tracking Issue] MLA performance tracking #897

yzh119 opened this issue Feb 24, 2025 · 0 comments

Comments

Copy link

Collaborator

yzh119 commented Feb 24, 2025 •

edited

Loading

This issue is the followup of #887. Per #892 (comment), we found flashinfer's MLA implementation is slower than FlashMLA in a lot of cases, we create this issue to track the remaining items to improve flashinfer MLA performance (mainly for Hopper):

Performance Tracking Table

Contributed by @abcdabcd987 :
https://docs.google.com/spreadsheets/d/1t0Txa7Ph9u7Su9LyWpS24vqr9A5FB-FyL0EZNpYOqwg/edit?gid=0#gid=0

Checklist

Slower on low batch-size (mainly because of split-k)
- Second stage of split-k is slow because of vector size. perf: fix the performance of second stage of split-k #894
- Load imbalance in the second stage of split-k perf: fix MLA split-k performance bug #898
- Use shared memory to accelerate second stage of split-k.
- Try using bf16/fp16 as partial output data type (currently we use fp32, which is better for accuracy but results in more memory transaction). perf: use f16 as split-k partial output data type #900
Slower for qo_len * head_dim > 64 (We split on qo_len * head_dim by a tile size of 64, different query tiles are dispatched to different CTAs, we need to improve the KV-Cache access pattern for 2 CTAs with the cluster).
- Use cluster sync to increase L2 hit rate.
- Use TMA and multi-casting for page_size >= 16
Try Different pipeline design
- Try FlashMLA-style warp specialization: FlashMLA and perf: FlashAttention-3 style MLA PageAttention #887 use different pipeline and warp specialization designs, more specifically:
  - Both FlashMLA and FlashInfer split PV on head-dimension, but FlashMLA do not split QK and FlashInfer split QK on KV dimension.
  - FlashMLA uses two warpgroups, one for QK and PV1, another one for data loading and PV2.
  - FlashInfer uses three warpgroups, one for data loading, one for QK1 and PV1, one for QK2 and PV2.
  - We should try FlashMLA-style warp specialization and check which one is better.
- Another possible warp specialization design is to introduce another warpgroup for QK: one for data loading, one for QK, one for PV1, one for PV2.
Misc items
- Register allocation optimizations perf: tweak register amount for producer/consumer in MLA template #896
- Defer synchronize p_smem and change unroll number perf: tweak the pipeline design of mla kernel #901

The text was updated successfully, but these errors were encountered:

abcdabcd987, yudian0504, MARD1NO, MichoChan, yinfan98, liangzelang, HarryWu99, tqchen, haichuan1221, VegetaPn, and 11 more reacted with thumbs up emoji

All reactions

👍 21 reactions

yzh119 mentioned this issue

FlashMLA from DeepSeek #892

Open

yzh119 changed the title ~~MLA performance tracking~~ [Tracking Issue] MLA performance tracking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

No branches or pull requests

1 participant

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.