Skip to content

[Opt] Add fused triton operator for MLA models to gsa_on_device#894

Open
Fengli5355 wants to merge 1 commit intoModelEngine-Group:developfrom
Fengli5355:br_fused_op
Open

[Opt] Add fused triton operator for MLA models to gsa_on_device#894
Fengli5355 wants to merge 1 commit intoModelEngine-Group:developfrom
Fengli5355:br_fused_op

Conversation

@Fengli5355
Copy link
Copy Markdown
Contributor

Purpose

Fuse hash and cache operators for MLA sparse module.

Modifications

  1. Implemented a fused hash and concat-cache MLA kernel. This optimization avoids storing intermediate results to HBM, reducing GPU memory I/O and improving execution efficiency with a single kernel launch.
  2. Add unit test of the mla hash-and-cache fused operator.
  3. Rename unit test of the gqa hash-and-cache fused operator.

Test

Test on DeepSeek-R1-awq with H100x8. TPOT is improved slightly.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant