You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is the followup of #887. Per #892 (comment), we found flashinfer's MLA implementation is slower than FlashMLA in a lot of cases, we create this issue to track the remaining items to improve flashinfer MLA performance (mainly for Hopper):
Slower for qo_len * head_dim > 64 (We split on qo_len * head_dim by a tile size of 64, different query tiles are dispatched to different CTAs, we need to improve the KV-Cache access pattern for 2 CTAs with the cluster).
This issue is the followup of #887. Per #892 (comment), we found flashinfer's MLA implementation is slower than FlashMLA in a lot of cases, we create this issue to track the remaining items to improve flashinfer MLA performance (mainly for Hopper):
Performance Tracking Table
Contributed by @abcdabcd987 :
https://docs.google.com/spreadsheets/d/1t0Txa7Ph9u7Su9LyWpS24vqr9A5FB-FyL0EZNpYOqwg/edit?gid=0#gid=0
Checklist
qo_len * head_dim
by a tile size of 64, different query tiles are dispatched to different CTAs, we need to improve the KV-Cache access pattern for 2 CTAs with the cluster).page_size >= 16
p_smem
and change unroll number perf: tweak the pipeline design of mla kernel #901The text was updated successfully, but these errors were encountered: