[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend

# Proposal to improve performance

## Stage 1: Low Latency Performance Optimization
### 1. Quantization Recipe
We will focus on performance optimization of several quantization recipes.

- [ ] FP8 with per-token-group scale activation and per-channel-group scale weight 
- [ ] FP8 with per-token-group scale activation and block-scale weight
- [ ] MXFP4 with block-scale activation and block-scale weight

### 2. Parallel Parallelism Recipe

 We will focus on the performance optimization of the following two parallel parallelism recipes including TP and EP at the first stage to reduce the latency.

### 3. Attention Backend

- [ ] Refactor the ROCm Attention backend to avoid some possible OOM issue https://github.com/vllm-project/vllm/pull/25763
- [ ] Integrate the performant MLA kernels (BF16/FP8) from [AITER](https://github.com/ROCm/aiter)


### 4. MoE Module Optimization
**(1) Support expert related fusion patterns**

- [ ] Fuse shared expert with routed expert: #24097 
- [ ] Fuse the small kernels into shared expert.

**(2) Support EPLB for EP parallelism**

- [ ] Currently in vLLM v0.11.0, EPLB feature is still not supported on ROCm backend. We will support it soon to alleviate the imbalance issue in EP scenario.

### 5. Some Fusion Patterns

- [ ] https://github.com/vllm-project/vllm/pull/25693 
- [ ] https://github.com/vllm-project/vllm/pull/25860
- [ ] https://github.com/vllm-project/vllm/pull/26575

### 6. Collective Communication Optimization

Under the TP8+TP8 and TP8+EP8 recipe, we will optimize the all-reduce kernel performance and implement the possible fusion patterns.
**(1) All-reduce kernel performance optimization**

- [ ] https://github.com/vllm-project/vllm/pull/23336
- [ ] https://github.com/vllm-project/vllm/pull/25618 

**(2) Support the all-reduce fusion patterns**
We will support the official vLLM fusion option "pass_config" of "enable_fi_allreduce_fusion" on ROCm backend. The fusion patterns including:

- [ ] AllReduceRMSNormPattern
- [ ] AllReduceFusedAddRMSNormPattern
- [ ] AllReduceFusedRMSNormStaticQuantFP8Pattern
- [ ] AllReduceFusedAddRMSNormStaticQuantFP8Pattern
- [ ] AllReduceFusedRMSNormStaticQuantNVFP4Pattern
- [ ] AllReduceFusedAddRMSNormStaticQuantNVFP4Pattern

### 7. Sampling Module Optimization

- [ ] Integrate [AITER](https://github.com/ROCm/aiter) TopK/TopP kernels https://github.com/vllm-project/vllm/pull/26084

### 8. MTP Optimization

- [ ] We will optimize the MTP=1 performance and integrate some performant MHA and MLA kernels for MTP=1 from [AITER](https://github.com/ROCm/aiter)


### 9. Long Context Optimization

We will support the Context Parallel for long context scenario and optimize the performance for ROCm backend. Some related RFC in community are tracked here.

- [ ] https://github.com/vllm-project/vllm/pull/23734
- [ ] https://github.com/vllm-project/vllm/issues/26133
- [ ] https://github.com/vllm-project/vllm/issues/25749

### 10. Reduce Host Overhead
We figure out some host overhead between two decode step, and some possible issues are also figured out in community. 

- [ ] https://github.com/vllm-project/vllm/issues/26369



## Stage 2: High Throughput Performance Optimization 
We will focus on the performance optimization on distributed inference scenario at the second stage with Disaggregated Prefilling and Data Parallel Parallelism to improve the throughput.

### 1. Disaggregated Prefilling 
We will support and optimize the [Disaggregate Prefilling](https://github.com/vllm-project/vllm/blob/v0.11.0/docs/features/disagg_prefill.md) feature on ROCm backend. 

### 2. Parallel Parallelism Recipe
We will optimize the distributed inference performance with Data Parallel Parallelism and large scale Expert Parallel Parallelism.
**(1)  Collective communication fusion patterns support**

- [ ] GEMMReduceScatterPattern
- [ ] AllGatherGEMMPattern





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend #26768

Proposal to improve performance

Stage 1: Low Latency Performance Optimization

1. Quantization Recipe

2. Parallel Parallelism Recipe

3. Attention Backend

4. MoE Module Optimization

5. Some Fusion Patterns

6. Collective Communication Optimization

7. Sampling Module Optimization

8. MTP Optimization

9. Long Context Optimization

10. Reduce Host Overhead

Stage 2: High Throughput Performance Optimization

1. Disaggregated Prefilling

2. Parallel Parallelism Recipe

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend #26768

Description

Proposal to improve performance

Stage 1: Low Latency Performance Optimization

1. Quantization Recipe

2. Parallel Parallelism Recipe

3. Attention Backend

4. MoE Module Optimization

5. Some Fusion Patterns

6. Collective Communication Optimization

7. Sampling Module Optimization

8. MTP Optimization

9. Long Context Optimization

10. Reduce Host Overhead

Stage 2: High Throughput Performance Optimization

1. Disaggregated Prefilling

2. Parallel Parallelism Recipe

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions