Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: Hongtao identified the performance issue with the initial implementation and updated the assignments of tiles to each SM. Performance with warp specialization (Batch, Heads, SeqLen, Dhead) triton_tutorial_flash_v2_tma_ws_persistent-tflops triton_tutorial_flash_v2_tma_ws-tflops triton_tutorial_flash_v2-tflops ------------------------------- --------------------------------------------------- ---------------------------------------- --------------------------------- (8, 16, 8192, 128) 516.164 490.451 423.905 Pull Request resolved: #77 Reviewed By: xuzhao9, htyu Differential Revision: D66463179 Pulled By: manman-ren fbshipit-source-id: 14fecc1a1449828bfd82600bd161596349da3084
- Loading branch information