Skip to content

Conversation

@zacliu2023
Copy link

Summary

  • Add FP8 quantization compatibility fix for OOT (Out-of-Tree) platform
  • Add triton-optimized kernel backend with 14 registered operators
  • Add consolidated benchmark suite for SELECTIVE vs FP8 strategy comparison
  • Enhance dispatch mechanism with vendor-based kernel selection

Key Changes

FP8 Compatibility Fix (vllm_fl/platform.py)

  • Add _register_oot_quantization_kernels() to patch vLLM's kernel mapping
  • Enable PlatformEnum.OOT -> CUDA kernels for FP8/INT8/mixed_precision
  • Fix KeyError when using quantization='fp8' with vLLM-FL

Triton-Optimized Backend (vllm_fl/dispatch/backends/vendor/triton_optimized/)

  • Register 14 Triton kernels with 2-40x speedups:
    • swap_blocks (40x), fused_residual_add_rmsnorm (2.7x)
    • merge_attn_states (5.9x), silu_and_mul (4x)
  • Priority-based selection (100 for fused, 95 for MLA, 90 for activation)

Dispatch Mechanism Enhancement

  • Vendor whitelist/blacklist: VLLM_FL_ALLOW_VENDORS, VLLM_FL_DENY_VENDORS
  • Fallback support with automatic retry on failure
  • Debug logging: VLLM_FL_DISPATCH_DEBUG=1

Benchmark Suite (benchmarks/kimi_k25/)

  • benchmark_e2e.py: SELECTIVE vs FP8 comparison (1k-4k input, batch 1-4)
  • profile_ops.py: CUDA kernel profiler with category analysis
  • run_benchmark.sh: Unified entry point

Performance Results (A100-SXM4-40GB)

Configuration Avg TPS vs BASELINE
BASELINE 1806 -
SELECTIVE 1810 +0.2%
FP8 1885 +4.4%

Configuration

export VLLM_PLATFORM_PLUGIN=fl
export USE_FLAGGEMS=True
export GEMS_MODE=SELECTIVE
export VLLM_FL_PREFER=vendor
export VLLM_FL_ALLOW_VENDORS=triton_optimized,cuda

PR Category

PR Types

PR Description

## Summary
- Add FP8 quantization compatibility fix for OOT (Out-of-Tree) platform
- Add triton-optimized kernel backend with 14 registered operators
- Add consolidated benchmark suite for SELECTIVE vs FP8 strategy comparison
- Enhance dispatch mechanism with vendor-based kernel selection

## Key Changes

### FP8 Compatibility Fix (vllm_fl/platform.py)
- Add `_register_oot_quantization_kernels()` to patch vLLM's kernel mapping
- Enable PlatformEnum.OOT -> CUDA kernels for FP8/INT8/mixed_precision
- Fix KeyError when using `quantization='fp8'` with vLLM-FL

### Triton-Optimized Backend (vllm_fl/dispatch/backends/vendor/triton_optimized/)
- Register 14 Triton kernels with 2-40x speedups:
  - swap_blocks (40x), fused_residual_add_rmsnorm (2.7x)
  - merge_attn_states (5.9x), silu_and_mul (4x)
- Priority-based selection (100 for fused, 95 for MLA, 90 for activation)

### Dispatch Mechanism Enhancement
- Vendor whitelist/blacklist: VLLM_FL_ALLOW_VENDORS, VLLM_FL_DENY_VENDORS
- Fallback support with automatic retry on failure
- Debug logging: VLLM_FL_DISPATCH_DEBUG=1

### Benchmark Suite (benchmarks/kimi_k25/)
- benchmark_e2e.py: SELECTIVE vs FP8 comparison (1k-4k input, batch 1-4)
- profile_ops.py: CUDA kernel profiler with category analysis
- run_benchmark.sh: Unified entry point

## Performance Results (A100-SXM4-40GB)
| Configuration | Avg TPS | vs BASELINE |
|---------------|---------|-------------|
| BASELINE      | 1806    | -           |
| SELECTIVE     | 1810    | +0.2%       |
| FP8           | 1885    | +4.4%       |

## Configuration
export VLLM_PLATFORM_PLUGIN=fl
export USE_FLAGGEMS=True
export GEMS_MODE=SELECTIVE
export VLLM_FL_PREFER=vendor
export VLLM_FL_ALLOW_VENDORS=triton_optimized,cuda
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@cyber-pioneer
Copy link
Collaborator

please replace all print with logger.info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants