Skip to content

[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend #26768

@wuhuikx

Description

@wuhuikx

Proposal to improve performance

Stage 1: Low Latency Performance Optimization

1. Quantization Recipe

We will focus on performance optimization of several quantization recipes.

  • FP8 with per-token-group scale activation and per-channel-group scale weight
  • FP8 with per-token-group scale activation and block-scale weight
  • MXFP4 with block-scale activation and block-scale weight

2. Parallel Parallelism Recipe

We will focus on the performance optimization of the following two parallel parallelism recipes including TP and EP at the first stage to reduce the latency.

3. Attention Backend

4. MoE Module Optimization

(1) Support expert related fusion patterns

(2) Support EPLB for EP parallelism

  • Currently in vLLM v0.11.0, EPLB feature is still not supported on ROCm backend. We will support it soon to alleviate the imbalance issue in EP scenario.

5. Some Fusion Patterns

6. Collective Communication Optimization

Under the TP8+TP8 and TP8+EP8 recipe, we will optimize the all-reduce kernel performance and implement the possible fusion patterns.
(1) All-reduce kernel performance optimization

(2) Support the all-reduce fusion patterns
We will support the official vLLM fusion option "pass_config" of "enable_fi_allreduce_fusion" on ROCm backend. The fusion patterns including:

  • AllReduceRMSNormPattern
  • AllReduceFusedAddRMSNormPattern
  • AllReduceFusedRMSNormStaticQuantFP8Pattern
  • AllReduceFusedAddRMSNormStaticQuantFP8Pattern
  • AllReduceFusedRMSNormStaticQuantNVFP4Pattern
  • AllReduceFusedAddRMSNormStaticQuantNVFP4Pattern

7. Sampling Module Optimization

8. MTP Optimization

  • We will optimize the MTP=1 performance and integrate some performant MHA and MLA kernels for MTP=1 from AITER

9. Long Context Optimization

We will support the Context Parallel for long context scenario and optimize the performance for ROCm backend. Some related RFC in community are tracked here.

10. Reduce Host Overhead

We figure out some host overhead between two decode step, and some possible issues are also figured out in community.

Stage 2: High Throughput Performance Optimization

We will focus on the performance optimization on distributed inference scenario at the second stage with Disaggregated Prefilling and Data Parallel Parallelism to improve the throughput.

1. Disaggregated Prefilling

We will support and optimize the Disaggregate Prefilling feature on ROCm backend.

2. Parallel Parallelism Recipe

We will optimize the distributed inference performance with Data Parallel Parallelism and large scale Expert Parallel Parallelism.
(1) Collective communication fusion patterns support

  • GEMMReduceScatterPattern
  • AllGatherGEMMPattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issuesrocmRelated to AMD ROCm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions