feat: Kimi-K2.5 optimization with FP8 support and benchmark suite #31

zacliu2023 · 2026-02-02T06:51:12Z

Summary

Add FP8 quantization compatibility fix for OOT (Out-of-Tree) platform
Add triton-optimized kernel backend with 14 registered operators
Add consolidated benchmark suite for SELECTIVE vs FP8 strategy comparison
Enhance dispatch mechanism with vendor-based kernel selection

Key Changes

FP8 Compatibility Fix (vllm_fl/platform.py)

Add _register_oot_quantization_kernels() to patch vLLM's kernel mapping
Enable PlatformEnum.OOT -> CUDA kernels for FP8/INT8/mixed_precision
Fix KeyError when using quantization='fp8' with vLLM-FL

Triton-Optimized Backend (vllm_fl/dispatch/backends/vendor/triton_optimized/)

Register 14 Triton kernels with 2-40x speedups:
- swap_blocks (40x), fused_residual_add_rmsnorm (2.7x)
- merge_attn_states (5.9x), silu_and_mul (4x)
Priority-based selection (100 for fused, 95 for MLA, 90 for activation)

Dispatch Mechanism Enhancement

Vendor whitelist/blacklist: VLLM_FL_ALLOW_VENDORS, VLLM_FL_DENY_VENDORS
Fallback support with automatic retry on failure
Debug logging: VLLM_FL_DISPATCH_DEBUG=1

Benchmark Suite (benchmarks/kimi_k25/)

benchmark_e2e.py: SELECTIVE vs FP8 comparison (1k-4k input, batch 1-4)
profile_ops.py: CUDA kernel profiler with category analysis
run_benchmark.sh: Unified entry point

Performance Results (A100-SXM4-40GB)

Configuration	Avg TPS	vs BASELINE
BASELINE	1806	-
SELECTIVE	1810	+0.2%
FP8	1885	+4.4%

Configuration

export VLLM_PLATFORM_PLUGIN=fl
export USE_FLAGGEMS=True
export GEMS_MODE=SELECTIVE
export VLLM_FL_PREFER=vendor
export VLLM_FL_ALLOW_VENDORS=triton_optimized,cuda

PR Category

PR Types

PR Description

## Summary - Add FP8 quantization compatibility fix for OOT (Out-of-Tree) platform - Add triton-optimized kernel backend with 14 registered operators - Add consolidated benchmark suite for SELECTIVE vs FP8 strategy comparison - Enhance dispatch mechanism with vendor-based kernel selection ## Key Changes ### FP8 Compatibility Fix (vllm_fl/platform.py) - Add `_register_oot_quantization_kernels()` to patch vLLM's kernel mapping - Enable PlatformEnum.OOT -> CUDA kernels for FP8/INT8/mixed_precision - Fix KeyError when using `quantization='fp8'` with vLLM-FL ### Triton-Optimized Backend (vllm_fl/dispatch/backends/vendor/triton_optimized/) - Register 14 Triton kernels with 2-40x speedups: - swap_blocks (40x), fused_residual_add_rmsnorm (2.7x) - merge_attn_states (5.9x), silu_and_mul (4x) - Priority-based selection (100 for fused, 95 for MLA, 90 for activation) ### Dispatch Mechanism Enhancement - Vendor whitelist/blacklist: VLLM_FL_ALLOW_VENDORS, VLLM_FL_DENY_VENDORS - Fallback support with automatic retry on failure - Debug logging: VLLM_FL_DISPATCH_DEBUG=1 ### Benchmark Suite (benchmarks/kimi_k25/) - benchmark_e2e.py: SELECTIVE vs FP8 comparison (1k-4k input, batch 1-4) - profile_ops.py: CUDA kernel profiler with category analysis - run_benchmark.sh: Unified entry point ## Performance Results (A100-SXM4-40GB) | Configuration | Avg TPS | vs BASELINE | |---------------|---------|-------------| | BASELINE | 1806 | - | | SELECTIVE | 1810 | +0.2% | | FP8 | 1885 | +4.4% | ## Configuration export VLLM_PLATFORM_PLUGIN=fl export USE_FLAGGEMS=True export GEMS_MODE=SELECTIVE export VLLM_FL_PREFER=vendor export VLLM_FL_ALLOW_VENDORS=triton_optimized,cuda

CLAassistant · 2026-02-02T06:51:21Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

cyber-pioneer · 2026-02-02T07:30:51Z

please replace all print with logger.info

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Kimi-K2.5 optimization with FP8 support and benchmark suite #31

feat: Kimi-K2.5 optimization with FP8 support and benchmark suite #31

Uh oh!

zacliu2023 commented Feb 2, 2026

Uh oh!

CLAassistant commented Feb 2, 2026

Uh oh!

cyber-pioneer commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Kimi-K2.5 optimization with FP8 support and benchmark suite #31

Are you sure you want to change the base?

feat: Kimi-K2.5 optimization with FP8 support and benchmark suite #31

Uh oh!

Conversation

zacliu2023 commented Feb 2, 2026

Summary

Key Changes

FP8 Compatibility Fix (vllm_fl/platform.py)

Triton-Optimized Backend (vllm_fl/dispatch/backends/vendor/triton_optimized/)

Dispatch Mechanism Enhancement

Benchmark Suite (benchmarks/kimi_k25/)

Performance Results (A100-SXM4-40GB)

Configuration

PR Category

PR Types

PR Description

Uh oh!

CLAassistant commented Feb 2, 2026

Uh oh!

cyber-pioneer commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants