-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Proposal to improve performance
Stage 1: Low Latency Performance Optimization
1. Quantization Recipe
We will focus on performance optimization of several quantization recipes.
- FP8 with per-token-group scale activation and per-channel-group scale weight
- FP8 with per-token-group scale activation and block-scale weight
- MXFP4 with block-scale activation and block-scale weight
2. Parallel Parallelism Recipe
We will focus on the performance optimization of the following two parallel parallelism recipes including TP and EP at the first stage to reduce the latency.
3. Attention Backend
- Refactor the ROCm Attention backend to avoid some possible OOM issue [ROCm][Perf] New design on ROCm AITER MHA backend Implementation #25763
- Integrate the performant MLA kernels (BF16/FP8) from AITER
4. MoE Module Optimization
(1) Support expert related fusion patterns
- Fuse shared expert with routed expert: [ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops #24097
- Fuse the small kernels into shared expert.
(2) Support EPLB for EP parallelism
- Currently in vLLM v0.11.0, EPLB feature is still not supported on ROCm backend. We will support it soon to alleviate the imbalance issue in EP scenario.
5. Some Fusion Patterns
- [Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693
- [ROCm][torch.compile] Adding ROCm-specific fusion pass for integrating aiter act/rms MXFP4 operators #25860
- [ROCm][FEAT] Support AITER RMSNorm quantization fusion pass #26575
6. Collective Communication Optimization
Under the TP8+TP8 and TP8+EP8 recipe, we will optimize the all-reduce kernel performance and implement the possible fusion patterns.
(1) All-reduce kernel performance optimization
- [ROCm][FEAT] Integrate AITER CustomAllreduce in cuda communicator. #23336
- [ROCm][Allreduce] Add dispatch mechanism for choosing performant allreduce implementations for AMD platforms #25618
(2) Support the all-reduce fusion patterns
We will support the official vLLM fusion option "pass_config" of "enable_fi_allreduce_fusion" on ROCm backend. The fusion patterns including:
- AllReduceRMSNormPattern
- AllReduceFusedAddRMSNormPattern
- AllReduceFusedRMSNormStaticQuantFP8Pattern
- AllReduceFusedAddRMSNormStaticQuantFP8Pattern
- AllReduceFusedRMSNormStaticQuantNVFP4Pattern
- AllReduceFusedAddRMSNormStaticQuantNVFP4Pattern
7. Sampling Module Optimization
- Integrate AITER TopK/TopP kernels [FEAT] [AITER] [ROCm] integrate aiter sampling ops #26084
8. MTP Optimization
- We will optimize the MTP=1 performance and integrate some performant MHA and MLA kernels for MTP=1 from AITER
9. Long Context Optimization
We will support the Context Parallel for long context scenario and optimize the performance for ROCm backend. Some related RFC in community are tracked here.
- [Feature] Support Decode Context Parallel (DCP) for MLA #23734
- [RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention #26133
- [RFC]: Support Prefill Context Parallel (PCP) #25749
10. Reduce Host Overhead
We figure out some host overhead between two decode step, and some possible issues are also figured out in community.
Stage 2: High Throughput Performance Optimization
We will focus on the performance optimization on distributed inference scenario at the second stage with Disaggregated Prefilling and Data Parallel Parallelism to improve the throughput.
1. Disaggregated Prefilling
We will support and optimize the Disaggregate Prefilling feature on ROCm backend.
2. Parallel Parallelism Recipe
We will optimize the distributed inference performance with Data Parallel Parallelism and large scale Expert Parallel Parallelism.
(1) Collective communication fusion patterns support
- GEMMReduceScatterPattern
- AllGatherGEMMPattern