[Opt] Add fused triton operator for MLA models to gsa_on_device by Fengli5355 · Pull Request #894 · ModelEngine-Group/unified-cache-management

Fengli5355 · 2026-04-03T06:45:36Z

Purpose

Fuse hash and cache operators for MLA sparse module.

Implemented a fused hash and concat-cache MLA kernel. This optimization avoids storing intermediate results to HBM, reducing GPU memory I/O and improving execution efficiency with a single kernel launch.
Add unit test of the mla hash-and-cache fused operator.
Rename unit test of the gqa hash-and-cache fused operator.

Test on DeepSeek-R1-awq with H100x8. TPOT is improved slightly.

[Opt] Add fused triton operator for MLA models to gsa_on_device

efa8f2e

Fengli5355 requested review from Infinite666, mag1c-h, wangwenxin0312, wuhuxiao and ygwpz as code owners April 3, 2026 06:45