Pure Rust + CUDA LLM inference engine. No PyTorch. No model framework runtime.
@@ -18,7 +18,7 @@
---
-pegainfer is a from-scratch LLM inference engine written in **~9.6K lines of Rust**, **~2.6K lines of CUDA**, and **~1.4K lines of Triton GPU kernels**. No PyTorch, no ONNX, no model framework runtime — just Rust plus CUDA, Triton AOT, and generated compatibility kernels.
+openinfer is a from-scratch LLM inference engine written in **~9.6K lines of Rust**, **~2.6K lines of CUDA**, and **~1.4K lines of Triton GPU kernels**. No PyTorch, no ONNX, no model framework runtime — just Rust plus CUDA, Triton AOT, and generated compatibility kernels.
The goal is to understand every layer of the inference stack by building it from the ground up, and to explore what a Rust-native inference engine can look like.
@@ -63,11 +63,11 @@ huggingface-cli download Qwen/Qwen3-4B --local-dir models/Qwen3-4B
# Build & start server on port 8000
export CUDA_HOME=/usr/local/cuda
-export PEGAINFER_TRITON_PYTHON=.venv/bin/python
+export OPENINFER_TRITON_PYTHON=.venv/bin/python
cargo run --release
```
-> **Note**: The server CLI is in `pegainfer-server`. Model crates such as `pegainfer-qwen3-4b`, `pegainfer-qwen35-4b`, and `pegainfer-deepseek-v4` contain model logic and diagnostics but are not server entrypoints. Use `cargo run --release` from the workspace root, or `cargo run --release -p pegainfer-server -- --model-path `.
+> **Note**: The server CLI is in `openinfer-server`. Model crates such as `openinfer-qwen3-4b`, `openinfer-qwen35-4b`, and `openinfer-deepseek-v4` contain model logic and diagnostics but are not server entrypoints. Use `cargo run --release` from the workspace root, or `cargo run --release -p openinfer-server -- --model-path `.
```bash
# Try it
@@ -92,7 +92,7 @@ cargo run --release -- --model-path models/Qwen3.5-4B
# DeepSeek V4 Flash requires the feature-gated MP8 path and TileLang at build time
uv pip install "tilelang==0.1.9"
-export PEGAINFER_TILELANG_PYTHON=.venv/bin/python
+export OPENINFER_TILELANG_PYTHON=.venv/bin/python
cargo run --release --features deepseek-v4 -- --model-path models/DeepSeek-V4-Flash
# Disable CUDA Graph (useful for debugging)
@@ -104,9 +104,9 @@ cargo run --release -- --cuda-graph=false
| Variable | Description |
|----------|-------------|
| `CUDA_HOME` | CUDA Toolkit path (default: `/usr/local/cuda`) |
-| `PEGAINFER_TRITON_PYTHON` | Python with Triton for build-time AOT compilation |
-| `PEGAINFER_TILELANG_PYTHON` | Python with TileLang for `deepseek-v4` build-time kernel generation |
-| `PEGAINFER_CUDA_SM` | GPU SM target override when `nvidia-smi` unavailable (e.g. `120`) |
+| `OPENINFER_TRITON_PYTHON` | Python with Triton for build-time AOT compilation |
+| `OPENINFER_TILELANG_PYTHON` | Python with TileLang for `deepseek-v4` build-time kernel generation |
+| `OPENINFER_CUDA_SM` | GPU SM target override when `nvidia-smi` unavailable (e.g. `120`) |
@@ -117,10 +117,10 @@ cargo run --release -- --cuda-graph=false
$env:CUDA_PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.x"
uv venv .venv --python 3.12
uv pip install "triton-windows<3.7"
-$env:PEGAINFER_TRITON_PYTHON = ".venv\Scripts\python.exe"
+$env:OPENINFER_TRITON_PYTHON = ".venv\Scripts\python.exe"
cargo build --release
-cargo run --release -p pegainfer-server -- --model-path models/Qwen3-4B
+cargo run --release -p openinfer-server -- --model-path models/Qwen3-4B
```
@@ -136,7 +136,7 @@ cargo run --release -p pegainfer-server -- --model-path models/Qwen3-4B
| [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | MoE + sparse attention, MP8 checkpoint | 671B total / 37B active | Initial greedy, feature-gated, 8-GPU MP8 |
| [Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) | MLA + MoE + Marlin INT4 | 1T total / 32B active | Feature-gated, `--features kimi-k2`, 8-GPU EP path |
-Model type is auto-detected from `config.json` — just point `--model-path` at any supported model directory. Feature-gated model lines require rebuilding `pegainfer-server` with the matching `--features ...` flag before launch.
+Model type is auto-detected from `config.json` — just point `--model-path` at any supported model directory. Feature-gated model lines require rebuilding `openinfer-server` with the matching `--features ...` flag before launch.
DeepSeek V4 support is intentionally narrower than the Qwen paths in the initial PR: it requires `--features deepseek-v4`, uses CUDA devices `0..7`, serves greedy requests only, terminates unsupported logprobs and non-greedy sampling requests with an explicit `stop_reason`, and does not use CUDA Graph yet.
@@ -162,12 +162,12 @@ HTTP / vLLM frontend → EngineHandle → per-model engine crate
│
┌───────────────────────┼───────────────────────┐
│ │ │
-pegainfer-qwen3-4b pegainfer-qwen35-4b pegainfer-deepseek-v4
+openinfer-qwen3-4b openinfer-qwen35-4b openinfer-deepseek-v4
(full attention) (24 linear + 8 full) (MP8 MoE + sparse attn)
│ │ │
└───────────────────────┼───────────────────────┘
│
- pegainfer-core runtime + pegainfer-kernels
+ openinfer-core runtime + openinfer-kernels
│
CUDA / cuBLAS / Triton / TileLang / FlashInfer
```
@@ -179,7 +179,7 @@ pegainfer-qwen3-4b pegainfer-qwen35-4b pegainfer-deepseek-v4
- **Fused operators where mature** — Qwen decode paths use fused attention/MLP kernels; DeepSeek V4 is currently a multi-stage MP8 path with TileLang kernels, NCCL reductions, and CUDA glue
- **BF16 storage, FP32 accumulation** — numerical stability without memory overhead
- **CUDA Graph** on Qwen decode paths — eliminates kernel launch overhead where enabled
-- **Per-model crate boundary** — Qwen3-4B owns its config, weights, scheduler/executor, tests, benches, and kernel plan in `pegainfer-qwen3-4b`
+- **Per-model crate boundary** — Qwen3-4B owns its config, weights, scheduler/executor, tests, benches, and kernel plan in `openinfer-qwen3-4b`
**Model details:**
@@ -200,17 +200,17 @@ pegainfer-qwen3-4b pegainfer-qwen35-4b pegainfer-deepseek-v4
cargo test --release --workspace --lib
# Accuracy and integration tests (need GPU + model weights)
-PEGAINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p pegainfer-qwen3-4b --test hf_golden_gate
-PEGAINFER_TEST_MODEL_PATH=models/Qwen3.5-4B cargo test --release -p pegainfer-qwen35-4b --test hf_golden_gate
-PEGAINFER_TEST_MODEL_PATH=models/Qwen3.5-4B cargo test --release -p pegainfer-qwen35-4b --test e2e_scheduler
-PEGAINFER_TEST_MODEL_PATH=models/DeepSeek-V4-Flash cargo test --release -p pegainfer-deepseek-v4 --features deepseek-v4 --test e2e
+OPENINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p openinfer-qwen3-4b --test hf_golden_gate
+OPENINFER_TEST_MODEL_PATH=models/Qwen3.5-4B cargo test --release -p openinfer-qwen35-4b --test hf_golden_gate
+OPENINFER_TEST_MODEL_PATH=models/Qwen3.5-4B cargo test --release -p openinfer-qwen35-4b --test e2e_scheduler
+OPENINFER_TEST_MODEL_PATH=models/DeepSeek-V4-Flash cargo test --release -p openinfer-deepseek-v4 --features deepseek-v4 --test e2e
```
### Triton AOT
Triton compiles the Qwen3.5 compatibility AOT kernels at build time. Qwen3-4B dense full-attention kernels are CUDA/cuBLAS/FlashInfer C++ wrappers. Runtime has no Python dependency — Triton is build-time only.
-See `pegainfer-kernels/tools/triton/README.md` for setup and troubleshooting.
+See `openinfer-kernels/tools/triton/README.md` for setup and troubleshooting.
### Source Layout
@@ -220,36 +220,36 @@ See `pegainfer-kernels/tools/triton/README.md` for setup and troubleshooting.
```
Cargo.toml # Virtual workspace root
-pegainfer-server/ # Product package: CLI, vLLM frontend, benchmarks
+openinfer-server/ # Product package: CLI, vLLM frontend, benchmarks
├── src/main.rs # CLI + vLLM/OpenAI server startup
├── src/vllm_frontend.rs # vLLM engine-core bridge into a generic EngineHandle
├── src/server_engine.rs # Model detection and shared server helpers
├── src/scheduler.rs # Compatibility re-export of core engine request/event types
├── src/ops.rs # Compatibility re-export of shared GPU ops
├── src/ops/tests.rs # Server package operator coverage tests
-├── src/tensor.rs # Re-export of pegainfer-kernels tensor types
+├── src/tensor.rs # Re-export of openinfer-kernels tensor types
├── src/sampler.rs # Temperature, top-k, top-p sampling
└── src/logging.rs # Runtime logging setup
-pegainfer-core/ # Shared runtime API for model crates
+openinfer-core/ # Shared runtime API for model crates
├── src/engine.rs # EngineHandle, GenerateRequest, TokenEvent
├── src/kv_pool.rs # Paged KV pool and request state
-├── src/ops.rs # Shared op wrappers over pegainfer-kernels
+├── src/ops.rs # Shared op wrappers over openinfer-kernels
└── src/weight_loader.rs # Safetensors helpers shared by model crates
-pegainfer-kernels/ # Shared GPU kernel/runtime crate
+openinfer-kernels/ # Shared GPU kernel/runtime crate
├── KERNELS.md # LLM routing index for model op -> wrapper -> FFI -> source
├── src/ # GPU tensor types, FFI, paged KV layout, Rust ops
├── csrc/ # Hand-written CUDA / FlashInfer C++ wrappers
└── tools/triton/ # Triton AOT kernels (build-time compiled)
-pegainfer-qwen3-4b/ # Qwen3-4B model-owned engine crate
+openinfer-qwen3-4b/ # Qwen3-4B model-owned engine crate
├── src/ # Config, weights, prefill/decode/unified, scheduler/executor
├── tests/ # Qwen3 HF logits gate and integration coverage
├── benches/ # Qwen3 model-level benchmarks
└── src/kernel_plan.rs # Model DAG phase -> kernel routing index
-pegainfer-qwen35-4b/ # Qwen3.5-4B model-owned engine crate
+openinfer-qwen35-4b/ # Qwen3.5-4B model-owned engine crate
├── src/ # Config, weights, prefill/decode/unified, recurrent state, scheduler
├── tests/ # Qwen3.5 HF logits gate and scheduler integration
└── benches/ # Qwen3.5 recurrent/norm operator benchmarks
diff --git a/docs/benchmarks/accuracy-eval-results.md b/docs/benchmarks/accuracy-eval-results.md
index 35f0086e..b628dc1f 100644
--- a/docs/benchmarks/accuracy-eval-results.md
+++ b/docs/benchmarks/accuracy-eval-results.md
@@ -5,31 +5,31 @@
| Model | Backend | GSM8K 8-shot (strict-match) | GSM8K 8-shot (flexible-extract) | Delta vs HF | Status |
|-------|---------|----------------------------:|--------------------------------:|:-----------:|:------:|
| Qwen3-4B | HF transformers | 85.82% | 85.82% | — | baseline |
-| Qwen3-4B | pegainfer | 85.37% | 85.44% | -0.45 pp / -0.38 pp | PASS |
+| Qwen3-4B | openinfer | 85.37% | 85.44% | -0.45 pp / -0.38 pp | PASS |
| Qwen3.5-4B | HF transformers | 79.45% | 79.45% | — | baseline |
-| Qwen3.5-4B | pegainfer (historical) | 1.97% | 10.61% | -77.48 pp / -68.84 pp | FAIL |
-| Qwen3.5-4B | pegainfer (#250 RoPE cache fix) | 79.38% | 79.30% | -0.07 pp / -0.15 pp | PASS |
+| Qwen3.5-4B | openinfer (historical) | 1.97% | 10.61% | -77.48 pp / -68.84 pp | FAIL |
+| Qwen3.5-4B | openinfer (#250 RoPE cache fix) | 79.38% | 79.30% | -0.07 pp / -0.15 pp | PASS |
**Pass criteria:** absolute delta < 1 percentage point.
## Qwen3-4B: PASS
-Pegainfer and HF transformers produce near-identical results. The 0.45% delta is well within the 1% threshold and consistent with expected bf16 tie-sensitive rounding differences (2/13 token-level mismatches observed in prior token-level validation).
+Openinfer and HF transformers produce near-identical results. The 0.45% delta is well within the 1% threshold and consistent with expected bf16 tie-sensitive rounding differences (2/13 token-level mismatches observed in prior token-level validation).
## Qwen3.5-4B: Historical FAIL — Recovered By #250
### Symptoms
-Before #250, pegainfer scored 10.61% (flexible) vs HF's 79.45% on GSM8K 8-shot.
+Before #250, openinfer scored 10.61% (flexible) vs HF's 79.45% on GSM8K 8-shot.
### Root Cause
-Before #250, Qwen3.5-4B produced divergent outputs in pegainfer vs HF transformers when processing long prompts (8-shot few-shot prefix, ~1771 input tokens):
+Before #250, Qwen3.5-4B produced divergent outputs in openinfer vs HF transformers when processing long prompts (8-shot few-shot prefix, ~1771 input tokens):
-- **0-shot (41 tokens):** pegainfer and HF output match — both generate `\n\n` followed by a correct answer.
+- **0-shot (41 tokens):** openinfer and HF output match — both generate `\n\n` followed by a correct answer.
- **8-shot (1771 tokens):** outputs diverge completely.
- HF: ` Natalia sold 48 / 2 = <<48/2=24>>24` (correct format, correct answer)
- - pegainfer: ` 168\n\nQuestion: Question: Question:...` (wrong number, degenerate repetition)
+ - openinfer: ` 168\n\nQuestion: Question: Question:...` (wrong number, degenerate repetition)
The first generated token already differed, indicating the prefill logits diverged for long sequences. This did not affect Qwen3-4B (which uses a standard transformer architecture), only Qwen3.5-4B (which uses a hybrid Mamba-attention architecture with different prefill kernels).
@@ -43,7 +43,7 @@ adds fail-closed cache coverage checks, and adds a long HF logits golden over
```bash
export MODEL_PATH=/path/to/Qwen3.5-4B
export LM_EVAL_BIN=/path/to/lm_eval
-export RESULT_ROOT=results/qwen35-gsm8k-8shot-pegainfer-issue250
+export RESULT_ROOT=results/qwen35-gsm8k-8shot-openinfer-issue250
$LM_EVAL_BIN run \
--model local-completions \
@@ -55,7 +55,7 @@ $LM_EVAL_BIN run \
```
Result file:
-`results/qwen35-gsm8k-8shot-pegainfer-issue250/qwen35-eval/results_*.json`
+`results/qwen35-gsm8k-8shot-openinfer-issue250/qwen35-eval/results_*.json`
| Filter | exact_match | stderr | Delta vs HF 79.45% |
| --- | ---: | ---: | ---: |
@@ -76,7 +76,7 @@ lm-eval: 0.4.11
transformers: 5.4.0
torch: 2.11.0+cu128
GPU: NVIDIA GeForce RTX 5070 Ti (16GB)
-pegainfer: commit 280e457 (main)
+openinfer: commit 280e457 (main)
```
### #250 Recovery Environment
@@ -89,7 +89,7 @@ dataset: cached openai/gsm8k snapshot
GPU: NVIDIA GeForce RTX 5090 (sm_120)
CUDA: 12.8
Triton AOT Python: Triton 3.4.0 environment
-pegainfer: issue #250 RoPE-cache fix branch
+openinfer: issue #250 RoPE-cache fix branch
```
### HF Baselines
@@ -110,18 +110,18 @@ pegainfer: issue #250 RoPE-cache fix branch
--output_path results/hf-qwen35-4b
```
-### Pegainfer Eval
+### Openinfer Eval
```bash
# Start server (one model at a time, single GPU)
-PEGAINFER_TRITON_PYTHON=.venv/bin/python \
+OPENINFER_TRITON_PYTHON=.venv/bin/python \
cargo run --release -- --model-path models/Qwen3-4B --port 8000 --cuda-graph=false
# Run eval (separate terminal, from repo root)
.venv/bin/lm_eval run --model local-completions \
--model_args "model=Qwen3-4B,base_url=http://localhost:8000/v1/completions,tokenizer_backend=huggingface,tokenizer=models/Qwen3-4B,tokenized_requests=False" \
--tasks gsm8k --num_fewshot 8 --batch_size 1 \
- --output_path results/pegainfer-qwen3-4b
+ --output_path results/openinfer-qwen3-4b
```
**Note:** `local-completions` requires `tokenized_requests=False` and `base_url` pointing to the full `/v1/completions` endpoint.
@@ -132,5 +132,5 @@ PEGAINFER_TRITON_PYTHON=.venv/bin/python \
|-----|----------|
| HF Qwen3-4B | ~1h43m |
| HF Qwen3.5-4B | ~2h11m |
-| pegainfer Qwen3-4B | ~1h20m |
-| pegainfer Qwen3.5-4B | ~1h16m |
+| openinfer Qwen3-4B | ~1h20m |
+| openinfer Qwen3.5-4B | ~1h16m |
diff --git a/docs/benchmarks/bs1-4k64-vllm-pegainfer.md b/docs/benchmarks/bs1-4k64-vllm-openinfer.md
similarity index 82%
rename from docs/benchmarks/bs1-4k64-vllm-pegainfer.md
rename to docs/benchmarks/bs1-4k64-vllm-openinfer.md
index 1d024abd..4c78b607 100644
--- a/docs/benchmarks/bs1-4k64-vllm-pegainfer.md
+++ b/docs/benchmarks/bs1-4k64-vllm-openinfer.md
@@ -1,8 +1,8 @@
-# bs1 4k/64 vLLM vs PegaInfer
+# bs1 4k/64 vLLM vs OpenInfer
**Created**: 2026-05-04
**Status**: complete
-**TL;DR**: On RTX 5090, `bs=1`, `input_len=4096`, `output_len=64`, `num_prompts=20`, `max_concurrency=1`, no vLLM prefix cache: PegaInfer finished `5.7%` faster wall-clock, with TTFT median `177.1ms` vs vLLM `197.8ms`. Decode TPOT was slightly slower: `6.47ms` vs vLLM `6.36ms`. PegaInfer's streaming `usage.completion_tokens` is overreported through the vLLM frontend in this run, so output throughput should be recomputed from the fixed target length.
+**TL;DR**: On RTX 5090, `bs=1`, `input_len=4096`, `output_len=64`, `num_prompts=20`, `max_concurrency=1`, no vLLM prefix cache: OpenInfer finished `5.7%` faster wall-clock, with TTFT median `177.1ms` vs vLLM `197.8ms`. Decode TPOT was slightly slower: `6.47ms` vs vLLM `6.36ms`. OpenInfer's streaming `usage.completion_tokens` is overreported through the vLLM frontend in this run, so output throughput should be recomputed from the fixed target length.
## Preparation
@@ -14,12 +14,12 @@
- `docs/subsystems/scheduler/scheduler.md` showed fixed-length single-concurrency results should be interpreted as latency probes rather than full serving saturation claims.
- **Plan**:
1. Use `vllm` as the client and vLLM server.
- 2. Use the release `pegainfer` binary for the PegaInfer server.
+ 2. Use the release `openinfer` binary for the OpenInfer server.
3. Run `input_len=4096`, `output_len=64`, `num_prompts=20`, `max_concurrency=1`, `request_rate=inf`, after a 3-request warmup for each engine.
4. Save JSON/log artifacts under a timestamped result directory and compare TTFT/TPOT/throughput.
- **Risks / open questions**:
- vLLM prefix caching must be disabled for a fair random-prompt prefill comparison.
- - PegaInfer's vLLM frontend may not report streaming usage with the exact same accounting as vLLM.
+ - OpenInfer's vLLM frontend may not report streaming usage with the exact same accounting as vLLM.
## Execution Log
@@ -29,9 +29,9 @@
- `uv pip list --python /bin/python` showed `vllm 0.19.1`, `torch 2.10.0`, `flashinfer-python 0.6.6`, and `flashinfer-cubin 0.6.6`.
- Confirmed model path:
- ``, size `7.6G`.
-- Built PegaInfer server binary in the validation worktree:
- - `PEGAINFER_CUDA_SM=120 PEGAINFER_TRITON_PYTHON=/bin/python cargo build --release -p pegainfer --bin pegainfer`
- - The validation shell session hung after the build process ended, but `target/release/pegainfer` existed with timestamp `2026-05-04 21:11`.
+- Built OpenInfer server binary in the validation worktree:
+ - `OPENINFER_CUDA_SM=120 OPENINFER_TRITON_PYTHON=/bin/python cargo build --release -p openinfer --bin openinfer`
+ - The validation shell session hung after the build process ended, but `target/release/openinfer` existed with timestamp `2026-05-04 21:11`.
### Step 2: vLLM run
- First vLLM run used default prefix-cache behavior and showed prefix cache hits in the server log, so it was not used for the final comparison.
@@ -52,12 +52,12 @@
- TPOT median `6.359ms`, p99 `6.366ms`.
- ITL median `6.389ms`, p99 `6.638ms`.
-### Step 3: PegaInfer run
-- PegaInfer served model ID was ``, not `Qwen3-4B`, so the client model name was set to ``.
+### Step 3: OpenInfer run
+- OpenInfer served model ID was ``, not `Qwen3-4B`, so the client model name was set to ``.
- Measured JSON:
- - `pegainfer-in4096-out64-c1-n20.json`
+ - `openinfer-in4096-out64-c1-n20.json`
- Command shape:
- - server: `target/release/pegainfer --model-path --port 8000`
+ - server: `target/release/openinfer --model-path --port 8000`
- client: same `vllm bench serve` shape as vLLM, except `--model `.
- Raw results:
- completed `20`, failed `0`.
@@ -68,10 +68,10 @@
- ITL median `6.464ms`, p99 `6.546ms`.
- Accounting caveat:
- JSON `total_output_tokens` was `5312`, but the fixed workload was `20 * 64 = 1280` output tokens and the timing matches 64 generated tokens per request.
- - For this run, PegaInfer output throughput should be recomputed as `1280 / 11.287s = 113.401 tok/s`, not the raw JSON `output_throughput`.
+ - For this run, OpenInfer output throughput should be recomputed as `1280 / 11.287s = 113.401 tok/s`, not the raw JSON `output_throughput`.
### Step 4: Comparison
-- PegaInfer vs vLLM no-prefix:
+- OpenInfer vs vLLM no-prefix:
- Wall duration: `11.287s` vs `11.968s` (`5.7%` faster).
- Request throughput: `1.772` vs `1.671 req/s` (`6.0%` higher).
- Corrected output throughput: `113.401` vs `106.952 tok/s` (`6.0%` higher).
@@ -83,12 +83,12 @@
## Debrief
-- **Outcome**: Completed the `bs=1`, `4k input`, `64 output` single-concurrency probe on RTX 5090. PegaInfer has better prefill/TTFT and slightly slower decode TPOT; wall-clock request throughput is higher because TTFT dominates this shape.
+- **Outcome**: Completed the `bs=1`, `4k input`, `64 output` single-concurrency probe on RTX 5090. OpenInfer has better prefill/TTFT and slightly slower decode TPOT; wall-clock request throughput is higher because TTFT dominates this shape.
- **Pitfalls encountered**:
- The first vLLM measurement had prefix cache hits. It was rerun with `--no-enable-prefix-caching`.
- The validation shell session can remain open after some long build/server scripts even when the validation work has finished; checking validation process state is necessary before assuming a command is still running.
- - PegaInfer's vLLM frontend overreported streaming `completion_tokens` for this benchmark, so the raw output throughput field in `vllm bench` JSON is not reliable for PegaInfer here.
+ - OpenInfer's vLLM frontend overreported streaming `completion_tokens` for this benchmark, so the raw output throughput field in `vllm bench` JSON is not reliable for OpenInfer here.
- **Lessons learned**:
- - For fixed-output PegaInfer comparisons through `vllm bench serve`, trust TTFT/TPOT/ITL and recompute output throughput from requested output length until streaming usage accounting is fixed.
+ - For fixed-output OpenInfer comparisons through `vllm bench serve`, trust TTFT/TPOT/ITL and recompute output throughput from requested output length until streaming usage accounting is fixed.
- Disable vLLM prefix caching for random synthetic prefill probes unless prefix-cache behavior is explicitly part of the experiment.
- At this shape, the new Qwen3 prefill q64 path shows up as a TTFT advantage against vLLM, while decode remains essentially parity.
diff --git a/docs/benchmarks/deepep-v2-vs-pplx-moe-backend.md b/docs/benchmarks/deepep-v2-vs-pplx-moe-backend.md
index 4f5d3331..930853a1 100644
--- a/docs/benchmarks/deepep-v2-vs-pplx-moe-backend.md
+++ b/docs/benchmarks/deepep-v2-vs-pplx-moe-backend.md
@@ -1,6 +1,6 @@
-# DeepEP V2 vs PegaInfer PPLX EP on H20 x8
+# DeepEP V2 vs OpenInfer PPLX EP on H20 x8
-> **TL;DR** On an 8x H20 node, DeepEP V2 ElasticBuffer/NCCL Gin is clearly ahead of the current PegaInfer PPLX EP microbenchmark on the tested MoE exchange shapes. In the paired run here, the directional dispatch+combine ratio is about 2.5x to 5.3x; against the earlier PPLX snapshot, it is about 2.4x to 4.5x. This is a backend direction check, not a dtype-identical replacement gate.
+> **TL;DR** On an 8x H20 node, DeepEP V2 ElasticBuffer/NCCL Gin is clearly ahead of the current OpenInfer PPLX EP microbenchmark on the tested MoE exchange shapes. In the paired run here, the directional dispatch+combine ratio is about 2.5x to 5.3x; against the earlier PPLX snapshot, it is about 2.4x to 4.5x. This is a backend direction check, not a dtype-identical replacement gate.
Last touched: 2026-05-25
@@ -8,8 +8,8 @@ Last touched: 2026-05-25
| Component | Revision |
| --- | --- |
-| PegaInfer paired run | `f071baa` |
-| PegaInfer historical PPLX snapshot | `ec514ef` |
+| OpenInfer paired run | `f071baa` |
+| OpenInfer historical PPLX snapshot | `ec514ef` |
| DeepEP | `723716f` |
## Hardware And Software
@@ -23,10 +23,10 @@ Last touched: 2026-05-25
## Method
-PegaInfer paired-run command:
+OpenInfer paired-run command:
```bash
-cargo run -r -p pegainfer-comm --bin pplx_a2a_bench -- --sweep --warmup 20 --repeats 100
+cargo run -r -p openinfer-comm --bin pplx_a2a_bench -- --sweep --warmup 20 --repeats 100
```
DeepEP V2 command template:
@@ -51,11 +51,11 @@ Sweep inputs:
DeepEP was run with `--test-first-only`, so the measured case is the first elastic EP case: copy enabled, expert alignment 128, FP8 dispatch enabled, BF16 combine, no previous event, synchronous path. Correctness checks were skipped with `--skip-check`; this run is latency-only.
-PegaInfer reports event-timed `max_rank_split_sum_us` for the full dispatch_send -> dispatch_recv -> combine_send -> combine_recv cycle. DeepEP reports profiler averages for ordinary dispatch and ordinary combine. For comparison, this note takes the ordinary dispatch line and ordinary combine line, sums dispatch+combine by rank, and reports both the worst rank and the mean rank. Because these are not identical timing harnesses, all ratios below are directional.
+OpenInfer reports event-timed `max_rank_split_sum_us` for the full dispatch_send -> dispatch_recv -> combine_send -> combine_recv cycle. DeepEP reports profiler averages for ordinary dispatch and ordinary combine. For comparison, this note takes the ordinary dispatch line and ordinary combine line, sums dispatch+combine by rank, and reports both the worst rank and the mean rank. Because these are not identical timing harnesses, all ratios below are directional.
## Results
-| Config | PegaInfer paired p50 us | PegaInfer paired mean us | DeepEP V2 worst-rank sum us | DeepEP V2 mean-rank sum us | Directional ratio vs PegaInfer p50 |
+| Config | OpenInfer paired p50 us | OpenInfer paired mean us | DeepEP V2 worst-rank sum us | DeepEP V2 mean-rank sum us | Directional ratio vs OpenInfer p50 |
| --- | ---: | ---: | ---: | ---: | ---: |
| dsv4/tok=1 | 87.5 | 91.0 | 23.815 | 23.632 | 3.7x |
| dsv4/tok=4 | 95.9 | 97.4 | 24.094 | 23.801 | 4.0x |
@@ -87,9 +87,9 @@ PegaInfer reports event-timed `max_rank_split_sum_us` for the full dispatch_send
| kimi-k2/tok=128 | 32.042 | 42.672 | 74.572 |
| kimi-k2/tok=256 | 50.735 | 71.921 | 122.617 |
-## PegaInfer Baseline Drift
+## OpenInfer Baseline Drift
-The table above uses the PegaInfer run taken in the same benchmarking session as the DeepEP run. The earlier PPLX benchmark snapshot in `docs/benchmarks/pplx-ep-a2a-h20-nvlink.md` was captured at `ec514ef`. Those two PegaInfer snapshots differ enough that the comparison should not pretend to be a precise speedup gate.
+The table above uses the OpenInfer run taken in the same benchmarking session as the DeepEP run. The earlier PPLX benchmark snapshot in `docs/benchmarks/pplx-ep-a2a-h20-nvlink.md` was captured at `ec514ef`. Those two OpenInfer snapshots differ enough that the comparison should not pretend to be a precise speedup gate.
Positive delta means the paired run here is slower than the historical snapshot.
@@ -113,13 +113,13 @@ Using the historical PPLX p50s instead of the paired run gives a directional rat
## Interpretation Guardrails
- DeepEP V2 was measured through the elastic EP path: ElasticBuffer with the NCCL Gin backend. The repository still builds legacy NVSHMEM pieces, but this V2 path is the one relevant to the current comparison.
-- The measured DeepEP V2 case uses FP8 dispatch and BF16 combine. PegaInfer PPLX currently benchmarks a BF16 payload. Treat the table as a backend signal, not an exact dtype-to-dtype gate.
-- DeepEP correctness checks were skipped in this latency run. A replacement decision needs a correctness run in the integrated PegaInfer path.
-- DeepEP `num_tokens` is a max-per-rank input; the test uses slightly different actual token counts across ranks. PegaInfer uses the fixed max token count per rank.
-- DeepEP numbers are profiler kernel averages. PegaInfer numbers are CUDA event timings around the benchmark cycle. The delta is large enough to be actionable, but integration work should add one apples-to-apples harness before replacing backend policy.
+- The measured DeepEP V2 case uses FP8 dispatch and BF16 combine. OpenInfer PPLX currently benchmarks a BF16 payload. Treat the table as a backend signal, not an exact dtype-to-dtype gate.
+- DeepEP correctness checks were skipped in this latency run. A replacement decision needs a correctness run in the integrated OpenInfer path.
+- DeepEP `num_tokens` is a max-per-rank input; the test uses slightly different actual token counts across ranks. OpenInfer uses the fixed max token count per rank.
+- DeepEP numbers are profiler kernel averages. OpenInfer numbers are CUDA event timings around the benchmark cycle. The delta is large enough to be actionable, but integration work should add one apples-to-apples harness before replacing backend policy.
## Read
-DeepEP V2 is especially strong at low token counts: the tested DSV4 and Kimi-K2 shapes sit around 24-34 us for tok <= 32, while the paired PegaInfer PPLX path is roughly 96-147 us. At larger payloads, DeepEP still holds about a 2.5x to 3.1x directional advantage in the paired run.
+DeepEP V2 is especially strong at low token counts: the tested DSV4 and Kimi-K2 shapes sit around 24-34 us for tok <= 32, while the paired OpenInfer PPLX path is roughly 96-147 us. At larger payloads, DeepEP still holds about a 2.5x to 3.1x directional advantage in the paired run.
-The next useful gate is a strict integration benchmark with the same payload dtype, token distribution, correctness checks, and PegaInfer scheduler-facing API cost included.
+The next useful gate is a strict integration benchmark with the same payload dtype, token distribution, correctness checks, and OpenInfer scheduler-facing API cost included.
diff --git a/docs/benchmarks/pplx-ep-a2a-h20-nvlink.md b/docs/benchmarks/pplx-ep-a2a-h20-nvlink.md
index 7ea51956..b3c39f01 100644
--- a/docs/benchmarks/pplx-ep-a2a-h20-nvlink.md
+++ b/docs/benchmarks/pplx-ep-a2a-h20-nvlink.md
@@ -17,7 +17,7 @@ Last touched: 2026-05
## Benchmark
-Binary: `pplx_a2a_bench --sweep` (in `pegainfer-comm`).
+Binary: `pplx_a2a_bench --sweep` (in `openinfer-comm`).
Each config bootstraps a fresh pplx-garden EP backend (CUMem + fabric MR + NVLink peer-map), runs 20 warmup + 100 measured iterations of the full dispatch_send → dispatch_recv → combine_send → combine_recv cycle, and reports `max_rank_split_sum_us` — the per-iteration maximum across all 8 ranks of the four-stage sum.
diff --git a/docs/conventions/coding-style.md b/docs/conventions/coding-style.md
index d70b0c65..0eff624b 100644
--- a/docs/conventions/coding-style.md
+++ b/docs/conventions/coding-style.md
@@ -6,4 +6,4 @@ Don't test for the sake of testing. Prefer integration tests over unit tests —
## Logging
-Log through `pegainfer-core::logging`. The text layout already prints each record's module target, so don't prefix messages with a module or model name — no `kimi-k2:`, no `Qwen3.5 `. Error messages in `anyhow!` / `bail!` keep their prefix; they surface to callers without a target.
+Log through `openinfer-core::logging`. The text layout already prints each record's module target, so don't prefix messages with a module or model name — no `kimi-k2:`, no `Qwen3.5 `. Error messages in `anyhow!` / `bail!` keep their prefix; they surface to callers without a target.
diff --git a/docs/index.md b/docs/index.md
index 9f1a4a33..b833cad6 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -25,10 +25,10 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| Path | TL;DR |
| --- | --- |
| `models/qwen3/roadmap.md` | Qwen3-4B roadmap (2026-06 review): line is the maturity bar; #220 RoPE OOB now fixed (sized cache + admission guard + kernel trap, gated by reject + in-window ITs); open set is per-row batch sampling, zero TP coverage, zero-adapter-only LoRA gate, dropped prefix-cache observability, stale docs, YaRN #8 follow-up. Sequenced Now/Next/Later + cleanup ledger. |
-| `models/qwen3/model-crate.md` | `pegainfer-qwen3-4b` owns Qwen3 config/weights/executor/scheduler/tests/kernel plan; root sees generic `EngineHandle`; split-K retuned to `256/64`, with 4k/64 serving TPOT p50 at `6.46ms` on RTX 5090. |
+| `models/qwen3/model-crate.md` | `openinfer-qwen3-4b` owns Qwen3 config/weights/executor/scheduler/tests/kernel plan; root sees generic `EngineHandle`; split-K retuned to `256/64`, with 4k/64 serving TPOT p50 at `6.46ms` on RTX 5090. |
| `models/qwen3/prefix-cache.md` | Prefix caching on by default for Qwen3-4B: full-block kvbm radix matching at the executor, suffix-only prefill. Repeated ~1900-token prompt TTFT 141.8 → 16.3ms p50 (8.7×); warm TTFT ≈ TPOT + ~5ms setup. Includes the RoPE scalar-path corruption fix and the drain-the-stream TTFT measurement pitfall. |
| `models/qwen3/accuracy-gate.md` | Qwen3-4B instance of the logits golden gate (`tests/hf_golden_gate.rs`): 48 teacher-forced sequences / 816 positions vs a stored HF bf16 golden, replayed over bs=1 / batched eager / CUDA-graph. Strict guards: regret check + mean ≤ 0.06 + p99 ≤ 0.20; absolute max printed but not asserted (coverage-unstable). Methodology in `subsystems/correctness/`. |
-| `models/qwen3/kernels-crate.md` | Phase 1 split implemented and 5090-verified: Qwen3-4B kernel surface lives in `pegainfer-kernels`; release build, test-target compile, accuracy gate, and bench snapshot pass. |
+| `models/qwen3/kernels-crate.md` | Phase 1 split implemented and 5090-verified: Qwen3-4B kernel surface lives in `openinfer-kernels`; release build, test-target compile, accuracy gate, and bench snapshot pass. |
| `models/qwen3/tp-design.md` | Qwen3 tensor-parallel design: `TP=2` milestone scope plus the controller/worker broadcast execution model, request identity, and coarse-grained step protocol for future TP/MoE work. |
| `models/qwen3/kv-pressure-hang.md` | Issue #85 Qwen3-4B KV pressure hang fixed by full-lifetime scheduler KV admission, waiting-queue deferral, cleanup on disconnect/error, impossible-request errors, scheduler/bridge gates, and real `vllm bench serve` QPS=2 `500/500` pass with post-pressure completion healthy. |
@@ -40,8 +40,8 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| `models/qwen35/kv-admission.md` | Issue #254 complete: Qwen3.5 now uses full-lifetime KV admission, deferred pressure handling, impossible-request rejection, explicit error semantics, direct rejection-event coverage, RTX 5090 e2e, and real HTTP pressure/post-pressure validation. |
| `models/qwen35/optimization.md` | Hybrid 24 linear + 8 full attn. At parity with vLLM: TTFT 234ms (+2%), TPOT 11.77ms (+1%). Post-accuracy-fix GDR decode kernel restore (#9). |
| `models/qwen35/accuracy.md` | Qwen3.5-4B HF bf16 logits goldens through `past_key_values`: short replay covers sequential graph, bucket-straddling batched graph, and slot-compaction; long replay covers 4097/8192-token prompts; full GSM8K 8-shot now matches the HF baseline within 0.15 percentage points. |
-| `models/qwen35/model-crate.md` | `pegainfer-qwen35-4b` owns Qwen3.5 model/scheduler/recurrent ops/tests/benches; root loads it through `EngineHandle`. Build/check/clippy, root bench sanity check, historical Qwen3.5 e2e, and scheduler e2e records live here. |
-| `models/qwen35/kernel-plan.md` | Qwen3.5-4B has a `pegainfer_qwen35_4b::kernel_plan()` static descriptor mirroring the qwen3 module — enumerates every prefill/decode/unified op with its Rust call site, backend, and notes, so you can dump the active kernel mix without reading call sites. Pure refactor (issue #256), no kernel behavior change. |
+| `models/qwen35/model-crate.md` | `openinfer-qwen35-4b` owns Qwen3.5 model/scheduler/recurrent ops/tests/benches; root loads it through `EngineHandle`. Build/check/clippy, root bench sanity check, historical Qwen3.5 e2e, and scheduler e2e records live here. |
+| `models/qwen35/kernel-plan.md` | Qwen3.5-4B has a `openinfer_qwen35_4b::kernel_plan()` static descriptor mirroring the qwen3 module — enumerates every prefill/decode/unified op with its Rust call site, backend, and notes, so you can dump the active kernel mix without reading call sites. Pure refactor (issue #256), no kernel behavior change. |
## models / deepseek-v4
@@ -56,7 +56,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| `models/deepseek-v4/moe-ag-rs.md` | Decode MoE now uses GPU AG/RS, GPU route compaction, and grouped TileLang FP4 local experts; no route/expert D2H in hot path. Current 1x32 TPOT avg `105.54ms`, exact E2E `20/20`. |
| `models/deepseek-v4/moe-tilelang-review.md` | Persistent rank workers + decode-only direct top-k MoE cut 1x32 steady TPOT to `80.49ms/token`; remaining cost is rank arrival skew before `107` f32 collectives/token. |
| `models/deepseek-v4/pplx-ep-integration.md` | DeepSeek V4 PPLX EP integration: pplx-garden decode MoE path, EP8 bootstrap, common NUMA rank-slice placement, and H200 steady TPOT p50 `66.65ms`. |
-| `models/deepseek-v4/kernel-paths.md` | DeepSeek V4 CUDA sources, TileLang generator path, and `pegainfer-kernels/KERNELS.md` routing index are organized. |
+| `models/deepseek-v4/kernel-paths.md` | DeepSeek V4 CUDA sources, TileLang generator path, and `openinfer-kernels/KERNELS.md` routing index are organized. |
## models / deepseek-v2-lite
@@ -74,24 +74,24 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| --- | --- |
| `models/kimi-k2/roadmap.md` | Cross-cutting Kimi-K2 plan, re-verified 2026-06-08 on 8×H200. Decode leads vLLM on the active TP1/DP8 **DeepEP** line (bs64 graph TPOT `26.3 ms` p50 / `30.5` p99); M1 serving contract (sampling/EOS/admission) + M2 accuracy gate shipped and green teacher-forced. Live frontier = serving perf: the "+51% HTTP" (#225) was a **bench/metric artifact** (measured: identical prompts under-measure decode ~7–15% via the Marlin expert GEMM; transport ≈0) — floor ~34 ms, a2a ~30% GPU (#228); TTFT 4.5×/31× behind vLLM (#224). Open correctness debt: tests (#222), concurrent mispick (#286), graph-replay gate (#300). |
| `models/kimi-k2/accuracy-gate.md` | vLLM-golden accuracy gate (#223):`tests/vllm_golden_gate.rs` + committed K2.6 fixture,teacher-forced regret sweep + free-greedy decode parity,走真实 serving path(TP1/DP8/EP8 PPLX);两档 regret 规则(自信位 0.30 / 平分布位 1.25 且每 pass 限 2 个),缺模型/fixture 显式 fail。 |
-| `models/kimi-k2/deepep-migration.md` | PPLX→DeepEP 迁移已实现:kimi 路径 PPLX 全删(moe_pplx.rs 没了,kimi crate 不再依赖 pegainfer-comm);decode `expand=true`+`cpu_sync=false` 零 host 同步/分配(graph-ready,#227 capture 仍关);Marlin 原地消费 recv buffer(alignment 8 == block size,identity routing + sentinel);router scale 在 residual 处应用,combine 提前一步 bf16 取整。待 8×H200 数值 gate + serving bench。 |
+| `models/kimi-k2/deepep-migration.md` | PPLX→DeepEP 迁移已实现:kimi 路径 PPLX 全删(moe_pplx.rs 没了,kimi crate 不再依赖 openinfer-comm);decode `expand=true`+`cpu_sync=false` 零 host 同步/分配(graph-ready,#227 capture 仍关);Marlin 原地消费 recv buffer(alignment 8 == block size,identity routing + sentinel);router scale 在 residual 处应用,combine 提前一步 bf16 取整。待 8×H200 数值 gate + serving bench。 |
| `models/kimi-k2/sampling.md` | Sampling param surface + design (#237):TP1/DP8 上 temperature/top_k/top_p 经单次 batched FlashInfer pass 生效(greedy 行保持 in-graph argmax,零开销),TP8 显式拒绝非 greedy;OpenAI 参数表逐项标注 honored/rejected/ignored,无静默路径;8×H200 已验证 e2e + TPOT 无回归。 |
| `models/kimi-k2/kv-cache-design.md` | KV cache 接入 qwen3 paged 栈 (#239→#230/#231),单 PR 落地:kimi kernel 层本就 paged,kernel 零改动;kvbm `BlockPool` per rank 取代静态 slot→pages 映射,full-lifetime reservation admission + 超界显式 Rejected,per-request cap 2048→8192(DP prompt 仍 ≤2048,PPLX fabric buffer 约束);#230/#231 的 substrate,8×H200 验证待做。 |
| `models/kimi-k2/optimization.md` | Kimi-K2 model card + decode 优化主线。Active mainline 是 TP1+DP8+EP8 PPLX(decode batch cap 64,buckets `[1,2,4,8,16,32,64]`,bs64 output `1336 tok/s`);下半篇的 TP8+EP8 NCCL bs4 graph TPOT `14.39ms` 路径是历史 bring-up 记录,保留以解释 MLA/MoE/collective kernel 结构。 |
| `models/kimi-k2/bringup-history.md` | Kimi-K2 text-only bring-up 压缩史(合并自旧 support-analysis/changelog/operator-todo trio):HF probe → 文本 manifest → TP8/EP8 sliced loader → MLA + Marlin WNA16 routed expert → NCCL bridge → bs4 wave decode → 整段 CUDA Graph → vLLM top-20 gate。持有 still-load-bearing 的 checkpoint/INT4/Marlin layout facts 与 #234 tombstone(expert-major CUTLASS 删除、weight_shape 不再加载、bs4 cap → 64)。 |
| `models/kimi-k2/vllm-path-comparison.md` | Kimi-K2 decode 路径对照:vLLM-style fused qkv_a、MoE shared/routed compute overlap、shared/dense gate-up fusion、routed scaled-add 和 bridge microbench 已过 H20 gate;output64 avg/p50/p99 均在 `15ms` 内,vLLM TP-only MoE final all-reduce BF16/F32 两版均慢于当前 RS bridge。 |
-| `models/kimi-k2/vllm-h20-baseline.md` | vLLM 0.19.0 H20 ×8 TP1+DP8+EP8 decode-heavy baseline:bs 1..256 扫描,bs=8 拐点 TPOT med `26.4ms` / aggregate `308 tok/s`,bs=256 拉到 `1131 tok/s`;同 client 下 pegainfer TP8+EP8 bs=4 TPOT `19.13ms` 比 vLLM 低 23%,但 HTTP 口径比 in-process 高 33%,frontend overhead 待查。 |
+| `models/kimi-k2/vllm-h20-baseline.md` | vLLM 0.19.0 H20 ×8 TP1+DP8+EP8 decode-heavy baseline:bs 1..256 扫描,bs=8 拐点 TPOT med `26.4ms` / aggregate `308 tok/s`,bs=256 拉到 `1131 tok/s`;同 client 下 openinfer TP8+EP8 bs=4 TPOT `19.13ms` 比 vLLM 低 23%,但 HTTP 口径比 in-process 高 33%,frontend overhead 待查。 |
| `models/kimi-k2/pplx-ep-decode.md` | PPLX EP decode bs=1 TPOT 37ms → 17.94ms(−52%),超过 NCCL no-graph 18.52ms。根因是 expert_padding=64 导致 Marlin 98% 计算浪费 + <<<1,1>>> 串行 routing kernel。含完整优化 log、failed approaches、nsys 对比数据。 |
| `models/kimi-k2/pplx-ep-correctness.md` | TP8/EP8 PPLX correctness baseline:H20 64-token token trace 与 TP8/EP8 NCCL 完全一致,hash `4920f088c2338236`;记录 recv capacity、routed-row top-k weight、F32 combine 边界。 |
| `models/kimi-k2/tp1-dp8-ep8-performance.md` | TP1 DP8 EP8 性能优化 ledger:O1 prompt_len1 decode admission 过 vLLM bs64 gate;O2 落地 5 个 decode kernel cherry-pick(cuBLASLt fixed-shape GEMM、argmax split、router fusion),精度由 base-vs-opt prefill logits A/B 压在 bf16 ULP 底,PPLX Marlin small-N tile 因 `-inf`/SIGSEGV 被定性为原分支精度破坏点并拒绝;bs64 TPOT 噪声内持平(p50 `40.58→40.09ms`)。 |
-| `models/kimi-k2/source-layout.md` | Kimi-K2 source files over 1k lines were split by responsibility; the largest Rust file under `pegainfer-kimi-k2/src` is now `layers/attention.rs` at 950 lines. |
+| `models/kimi-k2/source-layout.md` | Kimi-K2 source files over 1k lines were split by responsibility; the largest Rust file under `openinfer-kimi-k2/src` is now `layers/attention.rs` at 950 lines. |
| `models/kimi-k2/dp-design.md` | TP×DP 可配置并行:每 DP rank 是独立 decode engine,EP all-to-all 天然 sync,轻量 load balancer 做 request 路由。首批 TP1×DP8 + TP8×DP1。 |
## subsystems / runtime
| Path | TL;DR |
| --- | --- |
-| `subsystems/runtime/runtime.md` | Runtime complexity is controlled by a shared `pegainfer-core` that owns the generation contract and orchestration; per-model crates implement `ModelForward` so prefill/decode and hybrid attention stay hidden from the caller. State (`&mut`) is separated from weights (`&self`) for future bs > 1. |
+| `subsystems/runtime/runtime.md` | Runtime complexity is controlled by a shared `openinfer-core` that owns the generation contract and orchestration; per-model crates implement `ModelForward` so prefill/decode and hybrid attention stay hidden from the caller. State (`&mut`) is separated from weights (`&self`) for future bs > 1. |
| `subsystems/runtime/kv-cache-design.md` | Dynamo 式 logical/physical 分层 KV cache:BlockManager 管 block 生命周期和 admission,PhysicalBackend trait 管 GPU 内存和布局(FullAttention / MLA)。支持 TP / DP。基于 vLLM/Dynamo/pegaflow 调研。 |
| `subsystems/runtime/pegaflow-offload-integration.md` | 把 `pegaflow-core` 当进程内 Rust 库做 KV 卸载物理后端(HBM→DRAM/SSD/RDMA),补 kvbm 没写的卸载层。**Qwen3-4B full-attn 首发,端到端已在真实 GPU 跑通并验证**(async SAVE+LOAD 接进 executor/scheduler,纯 CPU-hit 与 GPU+CPU 组合 hit 恢复后 logits 与冷算一致)。pegaflow 经 git rev pin(#331+#333)。默认关,未接 server CLI。linear 排除,sparse 暂缓。 |
@@ -106,7 +106,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| Path | TL;DR |
| --- | --- |
| `subsystems/frontend/simulated-inference-engine.md` | CPU-only simulated model crate for vLLM/OpenAI frontend and `vllm bench serve` validation without CUDA, real model weights, or real-model performance claims. |
-| `subsystems/frontend/cpu-profiling-baseline.md` | Frontend CPU profiling baseline using `pegainfer-sim` with fixed TTFT=5ms/TPOT=12ms: 200 req / concurrency=16 shows ~150ms TTFT overhead (no dominant hotspot), heap allocation ~10%, stream polling ~7.5%, IPC ~1%; reproducible benchmark command and perf evidence documented. |
+| `subsystems/frontend/cpu-profiling-baseline.md` | Frontend CPU profiling baseline using `openinfer-sim` with fixed TTFT=5ms/TPOT=12ms: 200 req / concurrency=16 shows ~150ms TTFT overhead (no dominant hotspot), heap allocation ~10%, stream polling ~7.5%, IPC ~1%; reproducible benchmark command and perf evidence documented. |
## subsystems / correctness
@@ -118,16 +118,16 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| Path | TL;DR |
| --- | --- |
-| `subsystems/kernels/pegainfer-kernels-boundary.md` | Architecture decision: pegainfer should use reusable frontend/runtime/data-plane layers plus per-model engines; kernels become first-class assets through a ledger, simulator, and request tracing. |
+| `subsystems/kernels/openinfer-kernels-boundary.md` | Architecture decision: openinfer should use reusable frontend/runtime/data-plane layers plus per-model engines; kernels become first-class assets through a ledger, simulator, and request tracing. |
| `subsystems/kernels/kernel-op-reports.md` | Qwen3 kernel/report tooling is feature-gated: `qwen3_kernel_report` covers per-op kernel reports, and `qwen3_model_report` emits runtime-traced eager-DAG decode operator rollups with TensorSpec `KernelCall`s, latency stats, tables, and Graphviz DOT; measured FA2 `CTA_TILE_Q=64` prefill default in place. |
-| `subsystems/kernels/typed-forward-pipeline.md` | Reusable typed tensor pipeline macro in `pegainfer-kernels` so model crates can express common `typed_ops` chains without model-specific wrapper macros. |
+| `subsystems/kernels/typed-forward-pipeline.md` | Reusable typed tensor pipeline macro in `openinfer-kernels` so model crates can express common `typed_ops` chains without model-specific wrapper macros. |
## playbooks
| Path | TL;DR |
| --- | --- |
| `playbooks/developer-onboarding.md` | New-developer onboarding — toolchain, unified venv, build, tests, quick benchmark validation. |
-| `playbooks/bench-vs-vllm.md` | pegainfer vs vLLM comparative benchmarking: method, workflow, typical configs, gotchas. |
+| `playbooks/bench-vs-vllm.md` | openinfer vs vLLM comparative benchmarking: method, workflow, typical configs, gotchas. |
| `playbooks/model-optimization-pipeline.md` | Per-model optimization methodology: 2 standard profiles, vLLM baseline, e2e dashboard + append-only optimization log. |
| `playbooks/profiling-guide.md` | GPU profiling playbook: nsys pitfalls, diagnostic paths, measured kernel comparisons. |
| `playbooks/accuracy-parity-playbook.md` | Accuracy debugging playbook: truth-source rules, first-diff workflow, bf16 rounding traps, and verified Qwen3.5 parity commands. |
@@ -147,10 +147,10 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| Path | TL;DR |
| --- | --- |
-| `benchmarks/bs1-4k64-vllm-pegainfer.md` | RTX 5090 single-concurrency probe: `input_len=4096`, `output_len=64`, no vLLM prefix cache. PegaInfer TTFT median `177ms` vs vLLM `198ms`; TPOT median `6.47ms` vs `6.36ms`; corrected output throughput `+6%` for PegaInfer. |
-| `benchmarks/accuracy-eval-results.md` | Phase 1 GSM8K: Qwen3-4B PASS (pegainfer 85.37% vs HF 85.82%, delta -0.45 pp). Qwen3.5-4B historical FAIL recovered by #250 (strict 79.38%, flexible 79.30% vs HF 79.45%). |
+| `benchmarks/bs1-4k64-vllm-openinfer.md` | RTX 5090 single-concurrency probe: `input_len=4096`, `output_len=64`, no vLLM prefix cache. OpenInfer TTFT median `177ms` vs vLLM `198ms`; TPOT median `6.47ms` vs `6.36ms`; corrected output throughput `+6%` for OpenInfer. |
+| `benchmarks/accuracy-eval-results.md` | Phase 1 GSM8K: Qwen3-4B PASS (openinfer 85.37% vs HF 85.82%, delta -0.45 pp). Qwen3.5-4B historical FAIL recovered by #250 (strict 79.38%, flexible 79.30% vs HF 79.45%). |
| `benchmarks/pplx-ep-a2a-h20-nvlink.md` | pplx EP all-to-all latency on 8× H20 NV18 NVLink: DSV4 & Kimi-K2 shapes, tok=1..256. tok=1 p50 ~82μs, tok=256 p50 ~204/303μs. |
-| `benchmarks/deepep-v2-vs-pplx-moe-backend.md` | H20 x8 DeepEP V2 vs current PegaInfer PPLX EP backend comparison: ElasticBuffer/NCCL Gin shows a directional 2.5x-5.3x paired-run ratio on tested DSV4 and Kimi-K2 MoE exchange shapes, with dtype, correctness, harness, and PPLX baseline-drift caveats recorded. |
+| `benchmarks/deepep-v2-vs-pplx-moe-backend.md` | H20 x8 DeepEP V2 vs current OpenInfer PPLX EP backend comparison: ElasticBuffer/NCCL Gin shows a directional 2.5x-5.3x paired-run ratio on tested DSV4 and Kimi-K2 MoE exchange shapes, with dtype, correctness, harness, and PPLX baseline-drift caveats recorded. |
## conventions
diff --git a/docs/lessons/exact-match-gate-thread-cublas.md b/docs/lessons/exact-match-gate-thread-cublas.md
index 9a93216b..9181c5c4 100644
--- a/docs/lessons/exact-match-gate-thread-cublas.md
+++ b/docs/lessons/exact-match-gate-thread-cublas.md
@@ -7,7 +7,7 @@
## Scope
-This note records a cross-cutting runtime/correctness lesson, not a Qwen3.5-only story. It was lifted from the original Qwen3.5 debugging debrief because the concrete fix shipped, but the transferable lessons still matter. The triggering bug was fixed in `pegainfer-qwen35-4b`, but the takeaways apply to any model crate that moves a model onto a worker thread or guards greedy decode with an exact-text gate.
+This note records a cross-cutting runtime/correctness lesson, not a Qwen3.5-only story. It was lifted from the original Qwen3.5 debugging debrief because the concrete fix shipped, but the transferable lessons still matter. The triggering bug was fixed in `openinfer-qwen35-4b`, but the takeaways apply to any model crate that moves a model onto a worker thread or guards greedy decode with an exact-text gate.
## Background
@@ -18,7 +18,7 @@ The regression first appeared at `6a5b826` after cuBLAS handles became thread-lo
- **Read**:
- `docs/index.md` - Qwen3.5 accuracy and optimization docs are the relevant references.
- `docs/models/qwen35/model-crate.md` - confirmed the model-crate split reproduced the same Qwen3.5 e2e failure on old HEAD.
- - `git log -- pegainfer-qwen35-4b pegainfer-kernels ...` - identified Qwen3.5 and sampling-related commits since the last accuracy work (the historical bisect ran against the pre-split `src/model/qwen35` layout).
+ - `git log -- openinfer-qwen35-4b openinfer-kernels ...` - identified Qwen3.5 and sampling-related commits since the last accuracy work (the historical bisect ran against the pre-split `src/model/qwen35` layout).
- **Relevant history**:
- `docs/models/qwen35/model-crate.md` - old HEAD and the model-crate split both fail all 10 Qwen3.5 e2e cases with similar gibberish.
- Commit history has a suspicious sampling change: `020970b refactor(sampling): switch greedy decode to flashinfer top1 (#73)`.
@@ -27,7 +27,7 @@ The regression first appeared at `6a5b826` after cuBLAS handles became thread-lo
### Step 1: Reproduce and bisect through history
- Created a temporary worktree so the active model-crate diff stayed untouched.
-- Older commits needed the current local FlashInfer third-party tree copied into `third_party/flashinfer` and `PEGAINFER_TRITON_PYTHON` pointed at a Python with Triton.
+- Older commits needed the current local FlashInfer third-party tree copied into `third_party/flashinfer` and `OPENINFER_TRITON_PYTHON` pointed at a Python with Triton.
- Results:
- `24be186 refactor(embedding): keep token ids unsigned end-to-end (#71)` passed Qwen3.5 e2e.
- `020970b refactor(sampling): switch greedy decode to flashinfer top1 (#73)` failed a few cases with normal text, matching baseline drift rather than gibberish.
@@ -41,7 +41,7 @@ The regression first appeared at `6a5b826` after cuBLAS handles became thread-lo
- That showed logits/sampling were already wrong at the first sampled token after prefill; decode KV accumulation was not the primary cause.
### Step 3: Fix scheduler thread binding
-- Updated `pegainfer-qwen35-4b/src/scheduler.rs` so the scheduler thread:
+- Updated `openinfer-qwen35-4b/src/scheduler.rs` so the scheduler thread:
- calls `cuda_set_device` for the model device,
- binds the existing `CudaContext` to the scheduler thread,
- initializes thread-local cuBLAS handles on that thread,
@@ -59,14 +59,14 @@ The regression first appeared at `6a5b826` after cuBLAS handles became thread-lo
- That exact-text e2e and `test_data/Qwen3.5-4B.json` are historical now. The current accuracy gate is the HF logits gate; `e2e_scheduler` remains a scheduler integration test for request-flow behavior rather than an exact-text replacement.
### Step 5: Validation
-- Passed (set `PEGAINFER_CUDA_SM` only when overriding SM auto-detection):
+- Passed (set `OPENINFER_CUDA_SM` only when overriding SM auto-detection):
- `cargo fmt --all --check`
- `cargo check --release --workspace --all-targets`
- `cargo clippy --release --workspace --all-targets -- -D warnings`
- Two-run same-seed regen comparison with a temporary model alias while evaluating FlashInfer top1 behavior.
- - `cargo test --release -p pegainfer test_gpu_sample -- --nocapture`
- - `PEGAINFER_TEST_MODEL_PATH= cargo test --release -p pegainfer-qwen35-4b --test e2e -- --nocapture`
- - `PEGAINFER_TEST_MODEL_PATH= cargo test --release -p pegainfer-qwen35-4b --test e2e_scheduler -- --nocapture`
+ - `cargo test --release -p openinfer test_gpu_sample -- --nocapture`
+ - `OPENINFER_TEST_MODEL_PATH= cargo test --release -p openinfer-qwen35-4b --test e2e -- --nocapture`
+ - `OPENINFER_TEST_MODEL_PATH= cargo test --release -p openinfer-qwen35-4b --test e2e_scheduler -- --nocapture`
- `git diff --check`
## Debrief
diff --git a/docs/lessons/moe-zero-prefill-long-prefill.md b/docs/lessons/moe-zero-prefill-long-prefill.md
index 7beddb3f..73d840e2 100644
--- a/docs/lessons/moe-zero-prefill-long-prefill.md
+++ b/docs/lessons/moe-zero-prefill-long-prefill.md
@@ -3,16 +3,16 @@
> **TL;DR:** ZeRO-Prefill gives us a boundary for a future long-prefill cluster, not a router design or an implementation commitment.
>
> - **Want:** a long-P engine path that maximizes batch throughput once an external router has already selected long-prefill work.
-> - **Avoid:** putting long/delta classification, batch admission policy, or router state inside pegainfer.
+> - **Avoid:** putting long/delta classification, batch admission policy, or router state inside openinfer.
> - **Why:** long prefill can provide enough compute to hide expert-weight movement, while decode and delta-prefill have different latency and state constraints.
>
> **Status:** Discussion record. No implementation, measurement threshold, or connector API is committed here.
## Scope
-This note records what we learned from "ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving" ([arXiv:2605.02960](https://arxiv.org/abs/2605.02960)) and how it should shape future PegaFlow/PegaInfer planning for large MoE serving.
+This note records what we learned from "ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving" ([arXiv:2605.02960](https://arxiv.org/abs/2605.02960)) and how it should shape future PegaFlow/OpenInfer planning for large MoE serving.
-The assumed product shape is P/D separation with an external router. The router is responsible for deciding whether work belongs to long prefill, delta prefill, or decode. This document only describes what pegainfer should care about after the router has already handed it a long-prefill batch.
+The assumed product shape is P/D separation with an external router. The router is responsible for deciding whether work belongs to long prefill, delta prefill, or decode. This document only describes what openinfer should care about after the router has already handed it a long-prefill batch.
The goal is a reusable boundary record, not an implementation plan. Exact backend design, telemetry fields, measurement thresholds, and connector protocols are outside this document.
@@ -49,7 +49,7 @@ ZeRO-Prefill includes a KV-cache-free mode for prefill-only workloads that direc
The paper's waste sources matter most when a long-prefill batch has enough work to make compute the dominant resource. In our P/D-separated roadmap, short delta-prefill and decode should not be assumed to satisfy the same condition.
-For pegainfer, the first long-P goal is to keep selected long-prefill work compute-bound. Once the router has already selected a long-prefill batch, the engine should avoid fragmenting it into chunks that lose MFU or make expert transfer visible again.
+For openinfer, the first long-P goal is to keep selected long-prefill work compute-bound. Once the router has already selected a long-prefill batch, the engine should avoid fragmenting it into chunks that lose MFU or make expert transfer visible again.
**Want:** execution that preserves enough per-GPU prefill work to make long-P throughput the main objective.
@@ -105,7 +105,7 @@ The future measurement spec should explain at least:
## Derivation: Future Reuse
-When PegaFlow/PegaInfer planning revisits long-prefill MoE serving, use this note as the entry point:
+When PegaFlow/OpenInfer planning revisits long-prefill MoE serving, use this note as the entry point:
1. Assume the router has already selected a long-prefill batch.
2. Evaluate whether the engine keeps that batch compute-bound.
diff --git a/docs/models/deepseek-v2-lite/decode-attribution-gate.md b/docs/models/deepseek-v2-lite/decode-attribution-gate.md
index a17c1bb0..5f3d207a 100644
--- a/docs/models/deepseek-v2-lite/decode-attribution-gate.md
+++ b/docs/models/deepseek-v2-lite/decode-attribution-gate.md
@@ -10,16 +10,16 @@ This gate deliberately stays model-specific and shape-specific:
- Model: DeepSeek-V2-Lite.
- Shape: batch size `1`, `4`, or `8`, prompt `Hello`, prompt token ids `[17464]`, output length `16`.
-- Backends: default host-staged EP2 and `PEGAINFER_DSV2_LITE_EP_BACKEND=nccl`.
+- Backends: default host-staged EP2 and `OPENINFER_DSV2_LITE_EP_BACKEND=nccl`.
- Accuracy oracle: the same generated token/text/hash gate used by `hf-accuracy-gate.md`.
- Attribution source: `DeepSeekV2LiteEp2Generator::generate_greedy_with_attribution` for `batch-size=1`, and `DeepSeekV2LiteEp2Generator::generate_greedy_batch_same_prompt_with_attribution` for `batch-size>1`.
- GPU attribution source: CUDA events around selected stream sections in the explicit attribution path.
-- NVTX source: set `PEGAINFER_DSV2_LITE_NVTX=1` to emit matching ranges for those selected sections during a profiler run.
+- NVTX source: set `OPENINFER_DSV2_LITE_NVTX=1` to emit matching ranges for those selected sections during a profiler run.
Out of scope:
- sparse dispatch;
-- pegainfer-comm / NVLink backend;
+- openinfer-comm / NVLink backend;
- multi-node or generic EP topology;
- production continuous batching or broader prompts;
- performance improvement or throughput claims.
@@ -54,14 +54,14 @@ python tools/accuracy/hf_dump_dsv2_lite_ep2_greedy.py \
--output-len 16 \
--out target/accuracy/dsv2-lite-ep2/hf.json
-PEGAINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
-PEGAINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/host-staged.json \
- cargo test --release -p pegainfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
+OPENINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
+OPENINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/host-staged.json \
+ cargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
-PEGAINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
-PEGAINFER_DSV2_LITE_EP_BACKEND=nccl \
-PEGAINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/nccl.json \
- cargo test --release -p pegainfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
+OPENINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
+OPENINFER_DSV2_LITE_EP_BACKEND=nccl \
+OPENINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/nccl.json \
+ cargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
python tools/accuracy/compare_dsv2_lite_ep2_outputs.py \
--hf target/accuracy/dsv2-lite-ep2/hf.json \
@@ -71,18 +71,18 @@ python tools/accuracy/compare_dsv2_lite_ep2_outputs.py \
--require-all-exact
```
-Then collect attribution for the same two pegainfer backends. Use `--batch-size 1` for the original single-row gate, and `--batch-size 4` / `--batch-size 8` for the true-batch benchmark attribution shape:
+Then collect attribution for the same two openinfer backends. Use `--batch-size 1` for the original single-row gate, and `--batch-size 4` / `--batch-size 8` for the true-batch benchmark attribution shape:
```bash
-cargo run --release -p pegainfer-deepseek-v2-lite \
+cargo run --release -p openinfer-deepseek-v2-lite \
--features deepseek-v2-lite \
--bin dsv2_lite_ep2_decode_attribution \
-- --model-path models/DeepSeek-V2-Lite \
--batch-size 1 \
--out target/accuracy/dsv2-lite-ep2/host-staged-attribution.json
-PEGAINFER_DSV2_LITE_EP_BACKEND=nccl \
- cargo run --release -p pegainfer-deepseek-v2-lite \
+OPENINFER_DSV2_LITE_EP_BACKEND=nccl \
+ cargo run --release -p openinfer-deepseek-v2-lite \
--features deepseek-v2-lite \
--bin dsv2_lite_ep2_decode_attribution \
-- --model-path models/DeepSeek-V2-Lite \
@@ -90,15 +90,15 @@ PEGAINFER_DSV2_LITE_EP_BACKEND=nccl \
--out target/accuracy/dsv2-lite-ep2/nccl-attribution.json
for batch in 4 8; do
- cargo run --release -p pegainfer-deepseek-v2-lite \
+ cargo run --release -p openinfer-deepseek-v2-lite \
--features deepseek-v2-lite \
--bin dsv2_lite_ep2_decode_attribution \
-- --model-path models/DeepSeek-V2-Lite \
--batch-size "$batch" \
--out "target/accuracy/dsv2-lite-ep2/host-staged-batch${batch}-attribution.json"
- PEGAINFER_DSV2_LITE_EP_BACKEND=nccl \
- cargo run --release -p pegainfer-deepseek-v2-lite \
+ OPENINFER_DSV2_LITE_EP_BACKEND=nccl \
+ cargo run --release -p openinfer-deepseek-v2-lite \
--features deepseek-v2-lite \
--bin dsv2_lite_ep2_decode_attribution \
-- --model-path models/DeepSeek-V2-Lite \
@@ -107,13 +107,13 @@ for batch in 4 8; do
done
```
-For an Nsight Systems pass, run the same attribution command under the profiler and set `PEGAINFER_DSV2_LITE_NVTX=1`; the JSON `coverage` row then records `nvtx_ranges=emitted`. The NVTX labels are correlation markers for the selected GPU/NCCL sections, not timing evidence by themselves. Their wall-clock span can include CPU-side wrapper work, event setup, and synchronization around the section, so compare JSON `by_gpu_*` rows only with CUDA event timing, not with raw NVTX range duration.
+For an Nsight Systems pass, run the same attribution command under the profiler and set `OPENINFER_DSV2_LITE_NVTX=1`; the JSON `coverage` row then records `nvtx_ranges=emitted`. The NVTX labels are correlation markers for the selected GPU/NCCL sections, not timing evidence by themselves. Their wall-clock span can include CPU-side wrapper work, event setup, and synchronization around the section, so compare JSON `by_gpu_*` rows only with CUDA event timing, not with raw NVTX range duration.
To inspect the CUDA Graph readiness boundary for the current NCCL backend, run the attribution binary with the optional smoke flag:
```bash
-PEGAINFER_DSV2_LITE_EP_BACKEND=nccl \
- cargo run --release -p pegainfer-deepseek-v2-lite \
+OPENINFER_DSV2_LITE_EP_BACKEND=nccl \
+ cargo run --release -p openinfer-deepseek-v2-lite \
--features deepseek-v2-lite \
--bin dsv2_lite_ep2_decode_attribution \
-- --model-path models/DeepSeek-V2-Lite \
diff --git a/docs/models/deepseek-v2-lite/device-resident-nccl-combine.md b/docs/models/deepseek-v2-lite/device-resident-nccl-combine.md
index be7991a0..7ac35f70 100644
--- a/docs/models/deepseek-v2-lite/device-resident-nccl-combine.md
+++ b/docs/models/deepseek-v2-lite/device-resident-nccl-combine.md
@@ -12,16 +12,16 @@ Last touched: 2026-06
- `docs/models/deepseek-v2-lite/decode-attribution-gate.md` - acceptance uses the `Hello` / 16-token HF / host-staged / NCCL gate plus graph-readiness blockers.
- `docs/models/deepseek-v2-lite/hf-accuracy-gate.md` - same-host HF, host-staged, and NCCL token/text exactness is the correctness standard.
- `docs/models/deepseek-v2-lite/source-layout.md` - runtime responsibilities are split, and issue #275 was intentionally left as follow-up work.
- - `pegainfer-deepseek-v2-lite/src/runtime/moe.rs` - the pre-#275 NCCL combine path accumulated routed expert outputs in host `Vec` buffers, then copied H2D for NCCL and D2H before final H2D conversion.
- - `pegainfer-deepseek-v2-lite/src/nccl_backend.rs` - the pre-#275 NCCL combine path allocated send/recv device buffers inside each call and synchronized both streams.
- - `pegainfer-deepseek-v2-lite/src/runtime/readiness.rs` - the pre-#275 readiness report listed combine H2D, allocation, sync, and D2H blockers.
- - `pegainfer-kernels/src/ops/elementwise.rs` and `pegainfer-kernels/csrc/shared/elementwise.cu` - existing device f32/bf16 conversion helpers could be reused, but there was no f32 accumulation helper for bf16 expert output.
+ - `openinfer-deepseek-v2-lite/src/runtime/moe.rs` - the pre-#275 NCCL combine path accumulated routed expert outputs in host `Vec` buffers, then copied H2D for NCCL and D2H before final H2D conversion.
+ - `openinfer-deepseek-v2-lite/src/nccl_backend.rs` - the pre-#275 NCCL combine path allocated send/recv device buffers inside each call and synchronized both streams.
+ - `openinfer-deepseek-v2-lite/src/runtime/readiness.rs` - the pre-#275 readiness report listed combine H2D, allocation, sync, and D2H blockers.
+ - `openinfer-kernels/src/ops/elementwise.rs` and `openinfer-kernels/csrc/shared/elementwise.cu` - existing device f32/bf16 conversion helpers could be reused, but there was no f32 accumulation helper for bf16 expert output.
- **Relevant history**:
- `docs/models/deepseek-v2-lite/status.md` - NCCL plus CUDA Graph is the preferred direction, but the current gate must not be described as production EP.
- `docs/models/deepseek-v2-lite/source-layout.md` - local macOS checks are not enough for this path; remote 2-GPU validation is the real acceptance path.
- **Implemented**:
1. Add a shared CUDA helper that accumulates a bf16 single-token expert output into a f32 device contribution buffer at a selected token row.
- 2. Re-export that helper through `pegainfer-core::ops`.
+ 2. Re-export that helper through `openinfer-core::ops`.
3. Add reusable NCCL combine scratch buffers inside `NaiveNcclEp2Backend`, clear the f32 send scratch per MoE call, accumulate local/remote expert outputs on the owning device, all-reduce device buffers, and cast rank0 f32 result to bf16 on device.
4. Update graph-readiness blockers and attribution wording so removed combine H2D/D2H/allocation/sync blockers are no longer claimed, while remaining host routing and dense-exchange blockers stay explicit.
5. Run formatting and local compile gates, then use the provided remote GPU host for the DeepSeek-V2-Lite EP2 exactness and attribution gates.
@@ -37,9 +37,9 @@ Validated 2026-06-08 on the provided 2x RTX 5090 host with DeepSeek-V2-Lite snap
Commands run:
```bash
-cargo test --offline --release -p pegainfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 --no-run
+cargo test --offline --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 --no-run
-cargo clippy --offline --release -p pegainfer-deepseek-v2-lite \
+cargo clippy --offline --release -p openinfer-deepseek-v2-lite \
--features deepseek-v2-lite --bins --tests -- \
-D warnings \
-A clippy::option_option \
@@ -52,14 +52,14 @@ python tools/accuracy/hf_dump_dsv2_lite_ep2_greedy.py \
--output-len 16 \
--out target/accuracy/dsv2-lite-ep2/hf.json
-PEGAINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
-PEGAINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/host-staged.json \
- cargo test --offline --release -p pegainfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
+OPENINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
+OPENINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/host-staged.json \
+ cargo test --offline --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
-PEGAINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
-PEGAINFER_DSV2_LITE_EP_BACKEND=nccl \
-PEGAINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/nccl-after-decouple-cleanup.json \
- cargo test --offline --release -p pegainfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
+OPENINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
+OPENINFER_DSV2_LITE_EP_BACKEND=nccl \
+OPENINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/nccl-after-decouple-cleanup.json \
+ cargo test --offline --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
python tools/accuracy/compare_dsv2_lite_ep2_outputs.py \
--hf target/accuracy/dsv2-lite-ep2/hf.json \
@@ -68,8 +68,8 @@ python tools/accuracy/compare_dsv2_lite_ep2_outputs.py \
--out target/accuracy/dsv2-lite-ep2/comparison-after-decouple-cleanup.json \
--require-all-exact
-PEGAINFER_DSV2_LITE_EP_BACKEND=nccl \
- cargo run --offline --release -p pegainfer-deepseek-v2-lite \
+OPENINFER_DSV2_LITE_EP_BACKEND=nccl \
+ cargo run --offline --release -p openinfer-deepseek-v2-lite \
--features deepseek-v2-lite \
--bin dsv2_lite_ep2_decode_attribution \
-- --model-path models/DeepSeek-V2-Lite \
@@ -84,14 +84,14 @@ Results:
- Token SHA256: `4fb4c8825fe4d2c4a1d966da25c259abdf675f4de4548daa5d41aea7dfe30225`.
- Text SHA256: `0eedf11429e9ac13bb799c31665c6e9f70a1ac4493a08a3f3da9ecf39c1ec347`.
- Candidate NCCL attribution: `gpu_timing.sample_count=8384`, `failure_count=0`.
-- Initial remote cleanup gate: package `--bins --tests` clippy passed with only three explicit allows for then-existing lints (`pegainfer-core::logging` `option_option`, and two `host_ops` test lints).
+- Initial remote cleanup gate: package `--bins --tests` clippy passed with only three explicit allows for then-existing lints (`openinfer-core::logging` `option_option`, and two `host_ops` test lints).
Follow-up review gate on 2026-06-09 after fixing those lints:
```bash
cargo fmt --all --check
-cargo clippy --release -p pegainfer-deepseek-v2-lite \
+cargo clippy --release -p openinfer-deepseek-v2-lite \
--features deepseek-v2-lite --bins --tests -- -D warnings
```
diff --git a/docs/models/deepseek-v2-lite/hf-accuracy-gate.md b/docs/models/deepseek-v2-lite/hf-accuracy-gate.md
index e1e17d28..61c3680a 100644
--- a/docs/models/deepseek-v2-lite/hf-accuracy-gate.md
+++ b/docs/models/deepseek-v2-lite/hf-accuracy-gate.md
@@ -2,7 +2,7 @@
> **TL;DR:** HF comparison gate for issue #135 after PR #149 and PR #150. The remaining correctness question was not NCCL performance; it was whether the existing DeepSeek-V2-Lite EP=2 baseline matches Hugging Face `generate(use_cache=true)` greedy decode for `prompt="Hello"`, `batch=1`, `output_len=16`.
>
-> **Status:** Passing for the covered shape. The latest run is token-exact and text-exact across HF `generate(use_cache=true)` greedy, pegainfer host-staged EP2, and pegainfer NCCL EP2.
+> **Status:** Passing for the covered shape. The latest run is token-exact and text-exact across HF `generate(use_cache=true)` greedy, openinfer host-staged EP2, and openinfer NCCL EP2.
## Scope
@@ -10,7 +10,7 @@ In scope:
- HF truth: `AutoTokenizer` and `AutoModelForCausalLM` with `trust_remote_code=True`, `torch_dtype=torch.bfloat16`, `model.eval()`, and `torch.no_grad()`.
- Generation shape: batch `1`, prompt `Hello`, prompt token ids `[17464]`, output length `16`, greedy argmax only.
-- Pegainfer paths: default host-staged EP2 backend and explicit `PEGAINFER_DSV2_LITE_EP_BACKEND=nccl`.
+- Openinfer paths: default host-staged EP2 backend and explicit `OPENINFER_DSV2_LITE_EP_BACKEND=nccl`.
- Result comparison: generated token ids, generated text, token sha256, text sha256, and first different generated-token index.
Out of scope:
@@ -24,15 +24,15 @@ Out of scope:
| Issue / maintainer requirement | Covered by | Evidence |
| --- | --- | --- |
-| DeepSeek-V2-Lite config loads independently from DeepSeek V4 assumptions. | PR #149 | Dedicated `pegainfer-deepseek-v2-lite` config/weight/model crate. |
+| DeepSeek-V2-Lite config loads independently from DeepSeek V4 assumptions. | PR #149 | Dedicated `openinfer-deepseek-v2-lite` config/weight/model crate. |
| Single-node `ep_size=2` validates rank, expert ownership, and local expert count. | PR #149 | EP layout is fixed to rank 0 experts `0..31` and rank 1 experts `32..63`, with load-time validation. |
| Each rank only loads its owned 32 routed experts. | PR #149 | Driver rank loads rank 0 experts; expert rank loads only rank 1 routed experts. |
| Unsupported backend/topology reports explicit errors. | PR #149 / #150 | Unsupported device count, duplicate devices, cuda_graph, and backend names fail closed. |
| Minimal dispatch/combine path exists for the first correctness gate. | PR #149 | Host-staged dispatch/combine path remains the default baseline. |
-| Maintainer-requested naive NCCL backend exists before pegainfer-comm/NVLink work. | PR #150 | `PEGAINFER_DSV2_LITE_EP_BACKEND=nccl` path passes the same EP2 greedy E2E as host-staged. |
+| Maintainer-requested naive NCCL backend exists before openinfer-comm/NVLink work. | PR #150 | `OPENINFER_DSV2_LITE_EP_BACKEND=nccl` path passes the same EP2 greedy E2E as host-staged. |
| HF ground-truth accuracy comparison exists. | This gate | HF `generate(use_cache=true)` greedy, host-staged EP2, and NCCL EP2 are token/text exact for the covered shape. |
-Together with PR #149 and PR #150, this gate covers issue #135's correctness-first acceptance surface for the narrow EP=2 milestone. Follow-up work should be tracked separately for sparse/GPU dispatch, pegainfer-comm/NVLink integration, performance evidence, long context, and broader prompts/batches.
+Together with PR #149 and PR #150, this gate covers issue #135's correctness-first acceptance surface for the narrow EP=2 milestone. Follow-up work should be tracked separately for sparse/GPU dispatch, openinfer-comm/NVLink integration, performance evidence, long context, and broader prompts/batches.
## Commands
@@ -47,14 +47,14 @@ python tools/accuracy/hf_dump_dsv2_lite_ep2_greedy.py \
--output-len 16 \
--out target/accuracy/dsv2-lite-ep2/hf.json
-PEGAINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
-PEGAINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/host-staged.json \
- cargo test --release -p pegainfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
+OPENINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
+OPENINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/host-staged.json \
+ cargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
-PEGAINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
-PEGAINFER_DSV2_LITE_EP_BACKEND=nccl \
-PEGAINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/nccl.json \
- cargo test --release -p pegainfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
+OPENINFER_TEST_MODEL_PATH=models/DeepSeek-V2-Lite \
+OPENINFER_DSV2_LITE_EP_BACKEND=nccl \
+OPENINFER_DSV2_LITE_E2E_JSON_OUT=target/accuracy/dsv2-lite-ep2/nccl.json \
+ cargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --test e2e_ep2 -- --nocapture
python tools/accuracy/compare_dsv2_lite_ep2_outputs.py \
--hf target/accuracy/dsv2-lite-ep2/hf.json \
@@ -71,20 +71,20 @@ On Blackwell-class GPUs, make sure the selected NCCL runtime supports the device
## Interpretation
- `all_token_text_exact`: HF, host-staged, and NCCL agree on generated token ids and generated text.
-- `pegainfer_baseline_accuracy_gap`: host-staged and NCCL match each other, but both differ from HF. Treat this as a pegainfer baseline accuracy problem before touching NCCL transport.
+- `openinfer_baseline_accuracy_gap`: host-staged and NCCL match each other, but both differ from HF. Treat this as a openinfer baseline accuracy problem before touching NCCL transport.
- `nccl_transport_regression`: host-staged and NCCL differ. Debug the NCCL path before drawing any HF parity conclusion.
## Latest Evidence
2026-05-30, single-node 2 GPU validation with the same `models/DeepSeek-V2-Lite` snapshot for all three outputs. The model snapshot metadata recorded commit `604d5664dddd88a0433dbae533b7fe9472482de0`. The HF truth source used `AutoModelForCausalLM.generate(..., do_sample=false, use_cache=true)` with `torch==2.7.0+cu128` and `transformers==4.40.2` on 2x A800-SXM4-80GB:
-The comparison gate must be run with an HF JSON dumped on the same model directory and runtime as the pegainfer outputs. The Rust E2E keeps known HF-confirmed hash pairs for this narrow `Hello`/16 shape because the same snapshot has produced different greedy text on RTX 5090 and A800 while still matching HF on each host. This does not claim a model-runtime improvement, a manual-loop root cause, or a transport issue.
+The comparison gate must be run with an HF JSON dumped on the same model directory and runtime as the openinfer outputs. The Rust E2E keeps known HF-confirmed hash pairs for this narrow `Hello`/16 shape because the same snapshot has produced different greedy text on RTX 5090 and A800 while still matching HF on each host. This does not claim a model-runtime improvement, a manual-loop root cause, or a transport issue.
| Source | Backend | Tokens | Token SHA256 | Text SHA256 | Text |
| --- | --- | ---: | --- | --- | --- |
| HF | `generate(use_cache=true)` | 16 | `d05a7b0f0ac6435fb51040582a337d8b6d72844dd61194daa1b3090fa0e16ce8` | `4aaafbe4b3a46bc5b9ab5ea8d09d5fad71225006c2e234e87a928e3265b387c6` | `, I am a 20 year old female and I have been having a` |
-| pegainfer | host-staged | 16 | `d05a7b0f0ac6435fb51040582a337d8b6d72844dd61194daa1b3090fa0e16ce8` | `4aaafbe4b3a46bc5b9ab5ea8d09d5fad71225006c2e234e87a928e3265b387c6` | `, I am a 20 year old female and I have been having a` |
-| pegainfer | NCCL | 16 | `d05a7b0f0ac6435fb51040582a337d8b6d72844dd61194daa1b3090fa0e16ce8` | `4aaafbe4b3a46bc5b9ab5ea8d09d5fad71225006c2e234e87a928e3265b387c6` | `, I am a 20 year old female and I have been having a` |
+| openinfer | host-staged | 16 | `d05a7b0f0ac6435fb51040582a337d8b6d72844dd61194daa1b3090fa0e16ce8` | `4aaafbe4b3a46bc5b9ab5ea8d09d5fad71225006c2e234e87a928e3265b387c6` | `, I am a 20 year old female and I have been having a` |
+| openinfer | NCCL | 16 | `d05a7b0f0ac6435fb51040582a337d8b6d72844dd61194daa1b3090fa0e16ce8` | `4aaafbe4b3a46bc5b9ab5ea8d09d5fad71225006c2e234e87a928e3265b387c6` | `, I am a 20 year old female and I have been having a` |
Known HF-confirmed static E2E pairs for snapshot `604d5664dddd88a0433dbae533b7fe9472482de0`:
diff --git a/docs/models/deepseek-v2-lite/source-layout.md b/docs/models/deepseek-v2-lite/source-layout.md
index 739778a5..65fb2f91 100644
--- a/docs/models/deepseek-v2-lite/source-layout.md
+++ b/docs/models/deepseek-v2-lite/source-layout.md
@@ -22,7 +22,7 @@ boundaries:
## Layout
-`pegainfer-deepseek-v2-lite/src/runtime.rs` is now a facade that keeps the
+`openinfer-deepseek-v2-lite/src/runtime.rs` is now a facade that keeps the
public generator and result exports stable. Implementation moved into:
| File | Responsibility |
@@ -38,7 +38,7 @@ public generator and result exports stable. Implementation moved into:
## What Stayed
-- Public exports from `pegainfer-deepseek-v2-lite/src/lib.rs` still expose
+- Public exports from `openinfer-deepseek-v2-lite/src/lib.rs` still expose
`DeepSeekV2LiteEp2Generator`, `GenerationResult`,
`BatchedGenerationResult`, `GenerationStats`, and
`DecodeGraphReadinessReport`.
@@ -61,12 +61,12 @@ cargo fmt --all --check
Both passed.
Remote validation ran on Ubuntu 22.04 with 2x NVIDIA GeForce RTX 5090, driver
-580.105.08, CUDA 12.8, `PEGAINFER_CUDA_SM=120`, and
-`PEGAINFER_TRITON_PYTHON=/root/autodl-tmp/pegainfer-triton-venv/bin/python`.
+580.105.08, CUDA 12.8, `OPENINFER_CUDA_SM=120`, and
+`OPENINFER_TRITON_PYTHON=/root/autodl-tmp/openinfer-triton-venv/bin/python`.
Passed gates:
-- `cargo check --offline --release -p pegainfer-deepseek-v2-lite --features deepseek-v2-lite --lib --tests`
+- `cargo check --offline --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --lib --tests`
- HF oracle dump with `tools/accuracy/hf_dump_dsv2_lite_ep2_greedy.py`
- host-staged `tests/e2e_ep2.rs`
- NCCL `tests/e2e_ep2.rs` using `LD_LIBRARY_PATH=/root/autodl-tmp/nccl-2.27.7/nvidia/nccl/lib`
diff --git a/docs/models/deepseek-v2-lite/status.md b/docs/models/deepseek-v2-lite/status.md
index 30c04559..2c6b1982 100644
--- a/docs/models/deepseek-v2-lite/status.md
+++ b/docs/models/deepseek-v2-lite/status.md
@@ -28,7 +28,7 @@ The retained correctness gate is deliberately narrow:
- prompt token ids: `[17464]`;
- output length: `16`;
- generation mode: greedy;
-- backends: host-staged and `PEGAINFER_DSV2_LITE_EP_BACKEND=nccl`.
+- backends: host-staged and `OPENINFER_DSV2_LITE_EP_BACKEND=nccl`.
The comparison gate must be run on the same model snapshot for HF, host-staged, and NCCL outputs. Same-host comparison remains strict: HF, host-staged, and NCCL must be token-exact and text-exact. Host-staged remains the baseline oracle for NCCL transport changes.
@@ -62,12 +62,12 @@ PR #196 extends attribution for the same direct diagnostic shapes. The retained
In response to issue #170's request for a vLLM TP2+EP2 or pure TP2 comparison, a manual same-model snapshot was collected with `vllm bench serve` concurrency pressure `1`, `4`, and `8`.
-This table is retained only to document the current gap. It is not evidence of a complete, fair production-serving parity comparison, and `--max-concurrency` should be read as concurrent request pressure, not as proof of true internal PegaInfer batch size.
+This table is retained only to document the current gap. It is not evidence of a complete, fair production-serving parity comparison, and `--max-concurrency` should be read as concurrent request pressure, not as proof of true internal OpenInfer batch size.
| Engine | Mode | conc=1 TPOT ms | conc=4 TPOT ms | conc=8 TPOT ms | Output tok/s at 1/4/8 |
| --- | --- | ---: | ---: | ---: | --- |
-| PegaInfer | host-staged | 49.95 | 51.30 | 51.22 | 19.84 / 19.53 / 19.56 |
-| PegaInfer | NCCL | 178.31 | 173.22 | 174.46 | 5.59 / 5.77 / 5.73 |
+| OpenInfer | host-staged | 49.95 | 51.30 | 51.22 | 19.84 / 19.53 / 19.56 |
+| OpenInfer | NCCL | 178.31 | 173.22 | 174.46 | 5.59 / 5.77 / 5.73 |
| vLLM | TP2 default | 35.61 | 36.43 | 36.37 | 27.54 / 97.72 / 195.28 |
| vLLM | TP2+EP2 default | 34.15 | 34.97 | 34.88 | 28.87 / 101.52 / 204.08 |
@@ -75,7 +75,7 @@ Interpretation:
- at single-concurrency TPOT, host-staged is closer to vLLM than the current NCCL backend;
- NCCL remains a correctness-first backend and is still significantly slower than host-staged;
-- PegaInfer HTTP throughput did not scale with concurrency in this snapshot, so serving batching remains open;
+- OpenInfer HTTP throughput did not scale with concurrency in this snapshot, so serving batching remains open;
- vLLM TP2+EP2 worked in this environment and should stay in future comparison matrices.
## Claim Boundaries
@@ -86,7 +86,7 @@ Use these labels consistently:
| --- | --- | --- |
| `direct single-row` | In-process batch `1` decode. | HTTP serving throughput. |
| `direct same-prompt diagnostic batch` | Fixed same-prompt direct batch sizes `1/4/8`. | Production continuous batching or mixed-request scheduling. |
-| `HTTP concurrency pressure` | `vllm bench serve --max-concurrency N` against an HTTP endpoint. | True PegaInfer batch size unless the engine path proves it. |
+| `HTTP concurrency pressure` | `vllm bench serve --max-concurrency N` against an HTTP endpoint. | True OpenInfer batch size unless the engine path proves it. |
Do not claim:
@@ -113,8 +113,8 @@ The next implementation should be chosen from measured evidence:
- judge issue #170 by whether it reduces NCCL decode overhead and makes the path more graph-friendly.
2. Keep a fair serving benchmark contract around future performance work.
- - PegaInfer host-staged.
- - PegaInfer NCCL.
+ - OpenInfer host-staged.
+ - OpenInfer NCCL.
- vLLM TP2.
- vLLM TP2+EP2 when supported.
- default vLLM configuration plus a controlled configuration with cache/flag choices recorded.
diff --git a/docs/models/deepseek-v4/decode-performance.md b/docs/models/deepseek-v4/decode-performance.md
index ca114d91..160fc91b 100644
--- a/docs/models/deepseek-v4/decode-performance.md
+++ b/docs/models/deepseek-v4/decode-performance.md
@@ -37,7 +37,7 @@ Current evidence:
| Requirement | Evidence | Status |
| --- | --- | --- |
| Main objective: stable sub-`25ms/token` DeepSeek V4 decode without bs=1 or seq_len=1 specialization | Best retained repeats reached the `26.28-27.31ms/token` band; fresh 5-run stability sweep after the latest rejected act_quant probe is `28.29-28.91ms` aggregate steady TPOT while another CPU load was running | Not achieved. Keep the goal active. |
-| Fixed bench stable sub-30 with hash `6346f03343d75a65` | `$RESULT_ROOT/dsv4_stability_after_act_quant_revert_{1..5}.json` records 5 consecutive fixed bench runs, aggregate steady TPOT avg `28.291-28.912ms`, and all 15 per-iteration hashes `6346f03343d75a65`; reviewer rerun `$RESULT_ROOT/pegainfer_dev_pr101_bench_{1..5}.json` observed aggregate steady TPOT avg `27.552965-29.755957ms`, again with all 15 hashes `6346f03343d75a65` | Achieved for the retained tree. |
+| Fixed bench stable sub-30 with hash `6346f03343d75a65` | `$RESULT_ROOT/dsv4_stability_after_act_quant_revert_{1..5}.json` records 5 consecutive fixed bench runs, aggregate steady TPOT avg `28.291-28.912ms`, and all 15 per-iteration hashes `6346f03343d75a65`; reviewer rerun `$RESULT_ROOT/openinfer_dev_pr101_bench_{1..5}.json` observed aggregate steady TPOT avg `27.552965-29.755957ms`, again with all 15 hashes `6346f03343d75a65` | Achieved for the retained tree. |
| Exact E2E remains `20/20` | `$RESULT_ROOT/dsv4_fresh_e2e_after_w2_reduce_doc.log` records `All 20 DeepSeek V4 exact cases passed` | Achieved for the retained tree. |
| Public vLLM/SGLang MoE decomposition is replicated first | Runtime uses routed FP4 `W13 grouped GEMM -> fused SwiGLU + W2 activation quant -> W2 grouped GEMM`; old split W1/W3/SwiGLU/W2 public and FFI paths are removed | Achieved. |
| Deeper W13 accumulator -> SwiGLU -> W2-quant path is explored only after microbench/fuzz | TileLang W13 accumulator prototype was compiled after lowering fixes but failed the first active-expert fuzz shape, so it was removed before runtime integration | Explored and rejected; still open as a future true tensor-core epilogue project. |
@@ -50,14 +50,14 @@ Audit conclusion: the goal is not complete because stable sub-`25ms/token` has n
The goal is to copy the mature decomposition and validation discipline, not the framework surface. This table is the current "homework ledger" for DeepSeek V4 decode MoE:
-| Source idea | vLLM/SGLang anchor | PegaInfer status | Decision |
+| Source idea | vLLM/SGLang anchor | OpenInfer status | Decision |
| --- | --- | --- | --- |
| Experts core decomposition: `W13 grouped GEMM -> activation/quant -> W2 grouped GEMM` | vLLM `docs/design/fused_moe_modular_kernel.md`; vLLM `fused_moe/modular_kernel.py`; SGLang `srt/layers/moe/moe_runner/triton.py` | Retained as routed FP4 W13 grouped launch plus fused SwiGLU+W2 activation quant plus W2 grouped FP4 launch | Adopted. This is the baseline decomposition and the old split W1/W3/SwiGLU/W2 public path is removed. |
-| Prepare/finalize can be separate from experts | vLLM `FusedMoEPrepareAndFinalizeModular`; SGLang EP MoE dispatcher/finalizer split | Our AG/RS, route mapping, local experts, partial combine, and reduce-scatter are explicit stages | Adopted selectively. We keep the simpler PegaInfer scheduler/worker structure rather than importing generic dispatch classes. |
+| Prepare/finalize can be separate from experts | vLLM `FusedMoEPrepareAndFinalizeModular`; SGLang EP MoE dispatcher/finalizer split | Our AG/RS, route mapping, local experts, partial combine, and reduce-scatter are explicit stages | Adopted selectively. We keep the simpler OpenInfer scheduler/worker structure rather than importing generic dispatch classes. |
| Async prepare/finalize enables shared expert overlap | vLLM `prepare_no_receive`; vLLM modular-kernel doc notes shared expert overlap during communication | MoE hidden/token all-gather and reduce-scatter run on a MoE NCCL stream while shared expert runs on the main compute stream | Adopted. Full shared-compute-stream overlap changed token hash and was rejected. |
| `TopKWeightAndReduce` may live inside experts | vLLM `topk_weight_and_reduce.py`; vLLM `FusedMoEExpertsModular::finalize_weight_and_reduce_impl` | Atomic epilogue-shaped microbench was slower than current deterministic reduce | Rejected for current layout. This needs a token-major or deterministic W2 scheduler, not atomics bolted onto expert-major TileLang W2. |
| W13 layout should match fused SwiGLU convention | vLLM `oracle/mxfp4.py`; vLLM `quantization/utils/flashinfer_utils.py`; SGLang `moe_runner/flashinfer_cutedsl.py` | Pair-interleaved `[up, gate]` standalone SwiGLU+quant was byte-identical but mostly flat and tiny | Rejected as standalone. Keep the note for a true W13 epilogue fusion. |
-| FlashInfer/TRTLLM/CuteDSL FP4 MoE backends use specialized weight/scale reorder | vLLM `experts/trtllm_mxfp4_moe.py`; vLLM `oracle/mxfp4.py`; SGLang `quantization/mxfp4.py`; SGLang `moe_runner/flashinfer_trtllm.py` | Not integrated. Current PegaInfer weights are per-expert tensors and TileLang grouped GEMM takes pointer arrays; FlashInfer routes expect different packed/reordered layouts and runner-level metadata | Candidate, but only after a standalone grouped-GEMM microbench proves a real W13/W2 win on our exact shapes. Do not import the framework runner. |
+| FlashInfer/TRTLLM/CuteDSL FP4 MoE backends use specialized weight/scale reorder | vLLM `experts/trtllm_mxfp4_moe.py`; vLLM `oracle/mxfp4.py`; SGLang `quantization/mxfp4.py`; SGLang `moe_runner/flashinfer_trtllm.py` | Not integrated. Current OpenInfer weights are per-expert tensors and TileLang grouped GEMM takes pointer arrays; FlashInfer routes expect different packed/reordered layouts and runner-level metadata | Candidate, but only after a standalone grouped-GEMM microbench proves a real W13/W2 win on our exact shapes. Do not import the framework runner. |
| DeepGEMM-style deeper epilogue fusion | vLLM `experts/deep_gemm_moe.py`; SGLang DeepGEMM benchmarks under `benchmark/kernels/deepseek` | Scalar upper-bound microbench shows exact feasibility but absolute standalone delta is tiny | Candidate only as true tensor-core W13 epilogue work. Standalone SwiGLU/quant substitutions are no longer enough. |
| FP4 quant before communication for high-throughput all-gather | SGLang `srt/layers/moe/utils.py::should_use_flashinfer_cutlass_moe_fp4_allgather` | Not adopted. Our current AG gathers BF16 hidden before routing; changing this means routing/dispatch protocol changes, not a local kernel swap | Future architecture work. Needs correctness design because router consumes hidden before expert dispatch. |
| High-throughput bs>100 packed MoE layout | vLLM/SGLang packed FP4/MXFP4 backends and dispatcher/finalizer layouts | Not part of the current sub-25 latency patch. Current per-expert tensors are good for low-latency iteration but probably not the final throughput layout | Future architecture work. Design W13/W2 weight layout, FP4 scale layout, dispatch row layout, and combine/finalize together; keep conversion offline/load-time and avoid two production hot paths. |
@@ -261,7 +261,7 @@ The later retained path uses the same MoE NCCL stream for the earlier hidden/tok
4. main stream waits on all-gather completion before router, local experts, and routed combine.
5. routed reduce-scatter still uses the MoE NCCL stream and the final add waits on its completion event.
-This does not change route math, grouped GEMM shape, or batch/expert generality. It is not copied directly from vLLM/SGLang operator code; it is a PegaInfer scheduling step that becomes available once the vLLM/SGLang-style local expert path is exact and stable. The first reduce-scatter-only fixed bench run moved to `26.77-26.80ms/token`; two repeats landed in the `27.64-27.99ms/token` band; the post full-overlap revert calibration landed at `28.38-28.54ms/token`. Moving MoE all-gather to the same NCCL stream and overlapping shared expert with all-gather produced fresh repeated fixed benches at `26.28-27.31ms/token`, still with token hash `6346f03343d75a65`. Keep decision: retain. This is safer than the rejected full shared-expert overlap because shared expert stays on the main compute stream; only MoE collectives move to the MoE NCCL stream.
+This does not change route math, grouped GEMM shape, or batch/expert generality. It is not copied directly from vLLM/SGLang operator code; it is a OpenInfer scheduling step that becomes available once the vLLM/SGLang-style local expert path is exact and stable. The first reduce-scatter-only fixed bench run moved to `26.77-26.80ms/token`; two repeats landed in the `27.64-27.99ms/token` band; the post full-overlap revert calibration landed at `28.38-28.54ms/token`. Moving MoE all-gather to the same NCCL stream and overlapping shared expert with all-gather produced fresh repeated fixed benches at `26.28-27.31ms/token`, still with token hash `6346f03343d75a65`. Keep decision: retain. This is safer than the rejected full shared-expert overlap because shared expert stays on the main compute stream; only MoE collectives move to the MoE NCCL stream.
### Fused Q/KV RoPE projection
@@ -366,7 +366,7 @@ Validation on 5090:
| Check | Result |
| --- | --- |
| `cargo fmt --check` | passed |
-| `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench run 1 | steady TPOT avg `33.330ms`, p50 `32.858ms`, p95 `35.274ms`, hash `6346f03343d75a65` |
| fixed bench run 2 | steady TPOT avg `34.289ms`, p50 `33.979ms`, p95 `36.852ms`, hash `6346f03343d75a65` |
@@ -388,7 +388,7 @@ blockIdx.x 0..15 -> W1 pointer arrays -> gate output
blockIdx.x 16..31 -> W3 pointer arrays -> up output
```
-The C++ tool `pegainfer-kernels/tools/deepseek_v4/w13_grouped_fp4_bench.cu` links the generated TileLang object directly and compares:
+The C++ tool `openinfer-kernels/tools/deepseek_v4/w13_grouped_fp4_bench.cu` links the generated TileLang object directly and compares:
```text
baseline: grouped_gemm(W1) + grouped_gemm(W3)
@@ -400,10 +400,10 @@ Fuzz uses BF16 random input, TileLang `act_quant_k4096`, random FP4 bytes and bo
Verified compile command shape:
```bash
-OUT_DIR=$(find target/release/build/pegainfer-kernels-* -maxdepth 1 -type d -name out -printf '%T@ %p\n' | sort -nr | head -1 | cut -d' ' -f2-)
+OUT_DIR=$(find target/release/build/openinfer-kernels-* -maxdepth 1 -type d -name out -printf '%T@ %p\n' | sort -nr | head -1 | cut -d' ' -f2-)
/usr/local/cuda/bin/nvcc -std=c++17 -O3 -arch=sm_120 \
-I/usr/local/cuda/include \
- pegainfer-kernels/tools/deepseek_v4/w13_grouped_fp4_bench.cu \
+ openinfer-kernels/tools/deepseek_v4/w13_grouped_fp4_bench.cu \
"$OUT_DIR/libkernels_cuda.a" \
-lcudart \
-o $RESULT_ROOT/w13_grouped_fp4_bench
@@ -441,7 +441,7 @@ Runtime validation on 5090:
| Check | Result |
| --- | --- |
| `cargo fmt --check` | passed |
-| `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench text run | steady TPOT avg `34.22ms`, p50 `33.77ms`, p95 `36.53ms`, first decode avg `32.94ms` |
| fixed bench JSON run | steady TPOT avg `31.986ms`, p50 `31.458ms`, p95 `34.052ms`, first decode avg `30.544ms`, hash `6346f03343d75a65` |
@@ -461,11 +461,11 @@ Reference source positions:
| SGLang | `$LOCAL_WORKSPACE/sglang/python/sglang/srt/layers/moe/moe_runner/deep_gemm.py` | `grouped_gemm_nt_f8f8bf16_masked` writes `gateup_output`, then `sglang_per_token_group_quant_8bit(..., fuse_silu_and_mul=True)`, then W2 grouped GEMM. |
| SGLang C++ quant | `$LOCAL_WORKSPACE/sglang/sgl-kernel/csrc/gemm/per_token_group_quant_8bit_v2.cu` | The `fuse_silu_and_mul` path fuses activation with group quant, including masked expert layout. |
-The next reusable lesson is their problem-size representation. vLLM builds `expert_offsets`, `blockscale_offsets`, `problem_sizes1`, and `problem_sizes2` before CUTLASS grouped GEMM. SGLang's masked path passes `masked_m` and `expected_m` into DeepGEMM. Both make the GEMM scheduler aware of per-expert logical M. PegaInfer currently has `expert_indptr`, but the TileLang grouped launch still uses `dim3 grid(out_tiles, ceil(rows / 32), local_experts)` and returns inside the kernel when `blockIdx.y * 32 >= expert_m`. That is correct and GPU-resident, but it still launches empty CTAs for short or empty experts.
+The next reusable lesson is their problem-size representation. vLLM builds `expert_offsets`, `blockscale_offsets`, `problem_sizes1`, and `problem_sizes2` before CUTLASS grouped GEMM. SGLang's masked path passes `masked_m` and `expected_m` into DeepGEMM. Both make the GEMM scheduler aware of per-expert logical M. OpenInfer currently has `expert_indptr`, but the TileLang grouped launch still uses `dim3 grid(out_tiles, ceil(rows / 32), local_experts)` and returns inside the kernel when `blockIdx.y * 32 >= expert_m`. That is correct and GPU-resident, but it still launches empty CTAs for short or empty experts.
The first active-tile design check found a launch-side constraint: a GPU-generated active tile list cannot by itself shrink the next CUDA launch because grid dimensions are chosen on the host. Using a device-side `active_tile_count` would require a D2H count, CUDA dynamic parallelism, or launching the original capacity grid and returning on `tile >= active_count`. The last option preserves correctness but not the desired launch reduction. A better target is the existing `local_count`: decode route mapping already computes the actual number of local routes on GPU, while runtime still carries `num_expanded = routed.seq_len * topk` (`8 * 6 = 48` for MP8 decode) through expand, activation quant, and grouped GEMM. The hard part is exploiting `local_count` without reintroducing route metadata D2H.
-Historical PegaInfer path before the retained fused W2 activation-quant work:
+Historical OpenInfer path before the retained fused W2 activation-quant work:
```text
act_quant(expanded_input)
@@ -520,7 +520,7 @@ Runtime validation on 5090:
| Check | Result |
| --- | --- |
| `cargo fmt --check` | passed |
-| `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench JSON run 1 | steady TPOT avg `33.416ms`, p50 `32.884ms`, p95 `35.510ms`, first decode avg `31.885ms`, hash `6346f03343d75a65` |
| fixed bench JSON run 2 | steady TPOT avg `31.180ms`, p50 `30.675ms`, p95 `33.151ms`, first decode avg `30.020ms`, hash `6346f03343d75a65` |
@@ -559,9 +559,9 @@ Implementation notes:
| Check | Result |
| --- | --- |
| local `cargo fmt --check` | passed |
-| local `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| local `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| 5090 `cargo fmt --check` | passed |
-| 5090 `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| 5090 `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench JSON run 1 | steady TPOT avg `29.764ms`, p50 `29.296ms`, p95 `31.766ms`, first decode avg `28.575ms`, hash `6346f03343d75a65` |
| fixed bench JSON run 2 | steady TPOT avg `31.592ms`, p50 `31.082ms`, p95 `33.699ms`, first decode avg `30.019ms`, hash `6346f03343d75a65` |
@@ -600,7 +600,7 @@ These clears are semantic initialization, not removable allocation noise, but th
| Check | Result |
| --- | --- |
| local `cargo fmt --check` | passed |
-| local `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| local `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench JSON | `29.862ms`, `29.969ms`, `29.874ms`, all hash `6346f03343d75a65` |
| short nsys kernel summary | old `deepseek_moe_clear_i32_kernel` gone; `deepseek_moe_clear_mapping_kernel` appears once per mapping call |
@@ -625,7 +625,7 @@ For MP8 decode, `route_elems = global_batch * topk`; with the fixed single-reque
| Check | Result |
| --- | --- |
| local `cargo fmt --check` | passed |
-| local `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| local `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench JSON run 1 | `27.608ms`, `27.662ms`, `27.826ms`, all hash `6346f03343d75a65` |
| fixed bench JSON run 2 | `27.698ms`, `27.693ms`, `27.644ms`, all hash `6346f03343d75a65` |
@@ -656,7 +656,7 @@ This was exact-safe but not performance-safe. The route W13 kernel launched one
| --- | --- |
| local `cargo fmt --check` | passed |
| local `git diff --check` | passed |
-| local `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| local `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| 5090 release build for `bench_serving` and `deepseek_v4_e2e` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench JSON | aggregate steady TPOT avg `33.217ms`, p50 `33.003ms`, p95 `34.584ms`, decode throughput `30.115 tok/s` |
@@ -695,7 +695,7 @@ Runtime validation:
| --- | --- |
| local `cargo fmt --check` | passed |
| local `git diff --check` | passed |
-| local `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| local `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| 5090 release build for `bench_serving` and `deepseek_v4_e2e` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench run 1 | aggregate steady TPOT avg `27.807ms`; iterations `28.080ms`, `28.143ms`, `27.198ms`; all hash `6346f03343d75a65` |
@@ -725,7 +725,7 @@ This preserved the no-D2H rule and kept W2 grouped GEMM semantics unchanged. It
| --- | --- |
| local `cargo fmt --check` | passed |
| local `git diff --check` | passed |
-| local `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| local `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| 5090 release build for `bench_serving` and `deepseek_v4_e2e` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench JSON | aggregate steady TPOT avg `31.270ms`, p50 `30.660ms`, p95 `33.575ms` |
@@ -735,7 +735,7 @@ Drop decision: do not retain. Skipping empty rows inside a tiny regular kernel i
### Rejected: shrink grouped GEMM row-tile launch by seq_len
-vLLM's CUTLASS path passes logical per-expert `problem_sizes`, while PegaInfer's TileLang grouped FP4 launch uses a host grid of:
+vLLM's CUTLASS path passes logical per-expert `problem_sizes`, while OpenInfer's TileLang grouped FP4 launch uses a host grid of:
```text
grid.x = output tiles
@@ -767,8 +767,8 @@ Runtime validation on 5090:
| --- | --- |
| local `cargo fmt --check` | passed |
| local `git diff --check` | passed |
-| local `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
-| 5090 `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` | passed |
+| local `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
+| 5090 `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` | passed |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench JSON | per-iteration steady TPOT avg `28.504ms`, `28.460ms`, `28.735ms`; all hash `6346f03343d75a65` |
@@ -945,7 +945,7 @@ Additional `32768` probes stayed bitwise:
| active `8`, rows/active `8` | `0.122960ms -> 0.108667ms` | `0.063493ms -> 0.057330ms` |
| active `16`, rows/active `4` | `0.315041ms -> 0.283986ms` | `0.122991ms -> 0.082393ms` |
-Runtime change: grouped FP4 W13 and W2 wrappers generated by `pegainfer-kernels/tools/tilelang/deepseek_v4/generate.py` now request `32768` dynamic shared bytes. Dense FP4/FP8 wrappers keep their existing requests.
+Runtime change: grouped FP4 W13 and W2 wrappers generated by `openinfer-kernels/tools/tilelang/deepseek_v4/generate.py` now request `32768` dynamic shared bytes. Dense FP4/FP8 wrappers keep their existing requests.
5090 validation:
@@ -1040,7 +1040,7 @@ Full-runtime validation:
| Check | Result |
| --- | --- |
-| release `cargo check -p pegainfer-deepseek-v4 --features deepseek-v4` | passed locally and on 5090 |
+| release `cargo check -p openinfer-deepseek-v4 --features deepseek-v4` | passed locally and on 5090 |
| release `deepseek_v4_e2e` | `All 20 DeepSeek V4 exact cases passed` |
| fixed bench run 1 | aggregate steady TPOT avg `28.971ms`; per-iteration `28.727ms`, `28.963ms`, `29.224ms`; all hash `6346f03343d75a65` |
| fixed bench repeat | aggregate steady TPOT avg `29.797ms`; per-iteration `29.913ms`, `29.764ms`, `29.713ms`; all hash `6346f03343d75a65` |
@@ -1135,7 +1135,7 @@ Implementation notes:
| Check | Result |
| --- | --- |
-| release `cargo check -p pegainfer-deepseek-v4 --features deepseek-v4` | passed locally and on 5090 after generator fixes |
+| release `cargo check -p openinfer-deepseek-v4 --features deepseek-v4` | passed locally and on 5090 after generator fixes |
| microbench first fuzz shape | active `1`, rows/active `8`, experts `32` |
| fuzz result | FAIL: FP8 activation and E8M0 scale bytes differed from the current baseline |
| log | `$RESULT_ROOT/dsv4_w13_swiglu_quant_bench.log` |
@@ -1145,7 +1145,7 @@ Drop decision: do not retain. The generator can express the rough shape, but it
Evidence required for each adoption step:
- vLLM/SGLang source location and whether we copied the decomposition, the kernel shape, or only the validation idea.
-- standalone microbench with fuzz against the current PegaInfer baseline.
+- standalone microbench with fuzz against the current OpenInfer baseline.
- exact E2E `20/20`.
- fixed JSON bench with token hash `6346f03343d75a65`.
- repeated TPOT range, not a single fast run.
@@ -1168,7 +1168,7 @@ Earlier exploratory runs of the same BF16 direct shape landed at `26.194ms`, `30
Rejected variant: caching score-gate weights as F32 preserved exact E2E and token hash, but the fixed bench regressed to aggregate steady TPOT avg `29.148ms` with per-iteration `29.152ms`, `29.139ms`, and `29.155ms`. The extra F32 memory footprint and F32 math path were not worth keeping.
-Rejected variant: direct CUDA BF16 router projection. SGLang has a `fused_moe_router_cudacore` route in `$LOCAL_WORKSPACE/sglang/python/sglang/srt/layers/moe/router.py`, and TileKernels has warp-level top-k/scoring kernels under `$LOCAL_WORKSPACE/TileKernels/tile_kernels/moe/`. We tested the analogous PegaInfer idea with a temporary standalone bench: keep the existing select/normalization semantics, but replace the cuBLAS BF16 projection with a direct CUDA dot-product kernel over `(seq_len, n_experts, hidden_dim)`. The bench source was deleted after rejection so it cannot be accidentally wired into runtime.
+Rejected variant: direct CUDA BF16 router projection. SGLang has a `fused_moe_router_cudacore` route in `$LOCAL_WORKSPACE/sglang/python/sglang/srt/layers/moe/router.py`, and TileKernels has warp-level top-k/scoring kernels under `$LOCAL_WORKSPACE/TileKernels/tile_kernels/moe/`. We tested the analogous OpenInfer idea with a temporary standalone bench: keep the existing select/normalization semantics, but replace the cuBLAS BF16 projection with a direct CUDA dot-product kernel over `(seq_len, n_experts, hidden_dim)`. The bench source was deleted after rejection so it cannot be accidentally wired into runtime.
5090 microbench:
@@ -1715,7 +1715,7 @@ Standalone tool:
-O3 \
-std=c++17 \
-arch=sm_120 \
- pegainfer-kernels/tools/deepseek_v4/score_select_bench.cu \
+ openinfer-kernels/tools/deepseek_v4/score_select_bench.cu \
-o $RESULT_ROOT/dsv4_score_select_bench
$RESULT_ROOT/dsv4_score_select_bench
@@ -1823,8 +1823,8 @@ Verified command set for this PR:
```bash
cargo fmt --check
-cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4
-cargo check --release -p pegainfer-server --features deepseek-v4
+cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4
+cargo check --release -p openinfer-server --features deepseek-v4
gcc -shared -fPIC -O2 -Wall -Wextra -o $RESULT_ROOT/cuda_api_counter.so tools/cuda_api_counter.c -ldl
```
@@ -1833,8 +1833,8 @@ gcc -shared -fPIC -O2 -Wall -Wextra -o $RESULT_ROOT/cuda_api_counter.so tools/cu
Local:
- `cargo fmt --check`
-- `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4`
-- `cargo check --release -p pegainfer-server --features deepseek-v4`
+- `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4`
+- `cargo check --release -p openinfer-server --features deepseek-v4`
- `gcc -shared -fPIC -O2 -Wall -Wextra -o $RESULT_ROOT/cuda_api_counter.so tools/cuda_api_counter.c -ldl`
- `nm -D $RESULT_ROOT/cuda_api_counter.so` confirmed base and `_ptsz` wrappers
- `git diff --check`
@@ -1843,8 +1843,8 @@ Local:
5090:
- `cargo fmt --check`
-- `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4`
-- `cargo check --release -p pegainfer-server --features deepseek-v4`
+- `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4`
+- `cargo check --release -p openinfer-server --features deepseek-v4`
- release `deepseek_v4_e2e`: `All 20 DeepSeek V4 exact cases passed`
- release fixed bench log `$RESULT_ROOT/dsv4_pr_driver_numa_bench.log`: steady TPOT avg `35.253ms`, p50 `34.800ms`, p95 `37.335ms`, first decode avg `33.743ms`, hash `6346f03343d75a65`
- current clean fixed bench log `$RESULT_ROOT/dsv4_clean_tpot_now.log`: per-iteration steady TPOT avg `29.944ms`, `29.907ms`, `29.896ms`, all hash `6346f03343d75a65`
@@ -1889,7 +1889,7 @@ Local:
- post-revert hand act_quant exact E2E log `$RESULT_ROOT/dsv4_act_quant_restored_e2e.log`: `All 20 DeepSeek V4 exact cases passed`
- post-revert hand act_quant fixed bench logs `$RESULT_ROOT/dsv4_act_quant_restored_bench.json` and `$RESULT_ROOT/dsv4_act_quant_restored_bench_repeat.json`: first run aggregate steady TPOT avg `28.249ms` but one iteration hash changed to `a278a8140c25b812`; repeat aggregate steady TPOT avg `29.277ms`, per-iteration `29.265ms`, `29.272ms`, `29.293ms`, with all hashes restored to `6346f03343d75a65`
- post-revert hand act_quant 5-run stability logs `$RESULT_ROOT/dsv4_stability_after_act_quant_revert_{1..5}.json`: aggregate steady TPOT avg `28.912ms`, `28.867ms`, `28.291ms`, `28.375ms`, and `28.715ms`; all 15 per-iteration hashes were `6346f03343d75a65`. Another CPU load was running during this sweep, so the result is a conservative sub-30 stability check rather than a clean machine best-band.
-- reviewer 5090 5-run stability rerun `$RESULT_ROOT/pegainfer_dev_pr101_bench_{1..5}.json`: aggregate steady TPOT avg `28.505793ms`, `28.087102ms`, `29.755957ms`, `27.552965ms`, and `29.371630ms`; all 15 per-iteration hashes were `6346f03343d75a65`. One run wrote the complete JSON report and logged scheduler exit, then segfaulted in NCCL shutdown; treat that as the existing shutdown cleanup issue, not decode TPOT or token-correctness evidence.
+- reviewer 5090 5-run stability rerun `$RESULT_ROOT/openinfer_dev_pr101_bench_{1..5}.json`: aggregate steady TPOT avg `28.505793ms`, `28.087102ms`, `29.755957ms`, `27.552965ms`, and `29.371630ms`; all 15 per-iteration hashes were `6346f03343d75a65`. One run wrote the complete JSON report and logged scheduler exit, then segfaulted in NCCL shutdown; treat that as the existing shutdown cleanup issue, not decode TPOT or token-correctness evidence.
- fused Q/KV RoPE exact E2E log `$RESULT_ROOT/dsv4_qkv_rope_e2e.log`: `All 20 DeepSeek V4 exact cases passed`
- fused Q/KV RoPE fixed bench logs `$RESULT_ROOT/dsv4_qkv_rope_bench.log` and `$RESULT_ROOT/dsv4_qkv_rope_bench_repeat.log`: per-iteration steady TPOT avg `28.215ms`, `28.256ms`, `28.236ms`, then `27.096ms`, `28.565ms`, `28.349ms`; all hash `6346f03343d75a65`
- fused Q/KV RoPE short profile `$RESULT_ROOT/dsv4_qkv_rope_short.nsys-rep` and `$RESULT_ROOT/dsv4_qkv_rope_short_kernels_cuda_gpu_kern_sum.csv`: `deepseek_apply_rope_q_kv_kernel` appears in the kernel summary; residual hidden-RoPE kernels are from non-projection paths.
@@ -1898,7 +1898,7 @@ Local:
- rejected ratio-4 top-k concat removal exact E2E log `$RESULT_ROOT/dsv4_topk_no_concat_e2e.log`: `All 20 DeepSeek V4 exact cases passed`
- rejected ratio-4 top-k concat removal fixed bench log `$RESULT_ROOT/dsv4_topk_no_concat_bench.log`: aggregate steady TPOT avg `29.541ms`, per-iteration `29.551ms`, `29.539ms`, `29.532ms`; all hash `6346f03343d75a65`
- post-revert ratio-4 top-k fixed bench log `$RESULT_ROOT/dsv4_revert_topk_bench.log`: aggregate steady TPOT avg `28.333ms`, per-iteration `28.316ms`, `28.336ms`, `28.346ms`; all hash `6346f03343d75a65`
-- old split MoE/SwiGLU cleanup: local `cargo fmt --check`, local `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4`, local `git diff --check`, 5090 `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4`, and 5090 release builds for `bench_serving` and `deepseek_v4_e2e` passed after removing stale public/FFI exports.
+- old split MoE/SwiGLU cleanup: local `cargo fmt --check`, local `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4`, local `git diff --check`, 5090 `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4`, and 5090 release builds for `bench_serving` and `deepseek_v4_e2e` passed after removing stale public/FFI exports.
- old split MoE/SwiGLU cleanup exact E2E log `$RESULT_ROOT/dsv4_fused_cleanup_e2e.log`: `All 20 DeepSeek V4 exact cases passed`
- old split MoE/SwiGLU cleanup fixed bench log `$RESULT_ROOT/dsv4_fused_cleanup_bench.log`: aggregate steady TPOT avg `27.860ms`, per-iteration `27.863ms`, `27.845ms`, `27.872ms`; all hash `6346f03343d75a65`
- MoE reduce-scatter/shared overlap exact E2E log `$RESULT_ROOT/dsv4_moe_rs_overlap_e2e.log`: `All 20 DeepSeek V4 exact cases passed`
@@ -1939,7 +1939,7 @@ Local:
- rejected naive grouped FP4 `block_M=16` logs `$RESULT_ROOT/dsv4_w13_block_m16_bench.log` and `$RESULT_ROOT/dsv4_w13_block_m16_large_rows_bench.log`: decode-like small rows sped up, but rows/active `32` failed fuzz because grouped transforms/wrappers still have hard-coded `32`-row assumptions; not retained.
- rejected parameterized grouped FP4 `block_M=16` logs `$RESULT_ROOT/dsv4_w13_block_m16_param_fuzz.log`, `$RESULT_ROOT/dsv4_grouped_block_m16_e2e.log`, `$RESULT_ROOT/dsv4_grouped_block_m16_bench.log`, and `$RESULT_ROOT/dsv4_grouped_block_m16_bench_repeat.log`: broad fuzz and exact E2E passed, token hash stayed `6346f03343d75a65`, but fixed bench regressed to aggregate steady TPOT avg `28.971ms` then `29.797ms`; local and 5090 were restored to grouped FP4 `block_M=32`.
- post-restore grouped FP4 fixed bench log `$RESULT_ROOT/dsv4_grouped_block_m16_restored_bench.log`: aggregate steady TPOT avg `28.736ms`, per-iteration `28.445ms`, `28.998ms`, `28.763ms`; all hash `6346f03343d75a65`.
-- completion audit and cleanup: local `git diff --check`, `cargo fmt --check`, and `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` passed after documenting the sub-25 gap and deleting untracked rejected bench sources. The retained tool sources are `score_select_bench.cu`, `swiglu_quant_bench.cu`, and `w13_grouped_fp4_bench.cu`.
+- completion audit and cleanup: local `git diff --check`, `cargo fmt --check`, and `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` passed after documenting the sub-25 gap and deleting untracked rejected bench sources. The retained tool sources are `score_select_bench.cu`, `swiglu_quant_bench.cu`, and `w13_grouped_fp4_bench.cu`.
- vLLM/SGLang large-batch gap audit: source inspection confirmed the mature FP4 MoE throughput path combines static W13/W2 weight reorder, FP4 scale interleave, packed routed top-k, and problem-size-aware grouped backends. This supports keeping packed MoE layout as a separate bs>100 architecture project rather than mixing it into the current sub-25 latency patch.
- `gcc -shared -fPIC -O2 -Wall -Wextra -o $RESULT_ROOT/cuda_api_counter.so tools/cuda_api_counter.c -ldl`
- `nm -D $RESULT_ROOT/cuda_api_counter.so` confirmed base and `_ptsz` wrappers
diff --git a/docs/models/deepseek-v4/http-serving-benchmark.md b/docs/models/deepseek-v4/http-serving-benchmark.md
index 2a079667..49b3d609 100644
--- a/docs/models/deepseek-v4/http-serving-benchmark.md
+++ b/docs/models/deepseek-v4/http-serving-benchmark.md
@@ -34,21 +34,21 @@ requests:
Build the server on the target GPU host:
```bash
-cd /path/to/pegainfer
+cd /path/to/openinfer
export PATH=/usr/local/cuda-13.1/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.1
-export PEGAINFER_TILELANG_PYTHON=/path/to/venv/bin/python
-export PEGAINFER_TRITON_PYTHON=/path/to/venv/bin/python
-export PEGAINFER_NVCC_JOBS=8
-export CARGO_TARGET_DIR=/path/to/pegainfer-target
+export OPENINFER_TILELANG_PYTHON=/path/to/venv/bin/python
+export OPENINFER_TRITON_PYTHON=/path/to/venv/bin/python
+export OPENINFER_NVCC_JOBS=8
+export CARGO_TARGET_DIR=/path/to/openinfer-target
-cargo build --release -p pegainfer-server --features deepseek-v4 --bin pegainfer
+cargo build --release -p openinfer-server --features deepseek-v4 --bin openinfer
```
Start the OpenAI-compatible HTTP endpoint:
```bash
-$CARGO_TARGET_DIR/release/pegainfer \
+$CARGO_TARGET_DIR/release/openinfer \
--model-path $MODEL_DIR \
--port 18118 2>&1 | tee $RESULT_ROOT/dsv4_http_server.log
```
@@ -56,7 +56,7 @@ $CARGO_TARGET_DIR/release/pegainfer \
For prefill phase attribution, start the endpoint with profiling enabled:
```bash
-$CARGO_TARGET_DIR/release/pegainfer \
+$CARGO_TARGET_DIR/release/openinfer \
--model-path $MODEL_DIR \
--port 18118 \
--deepseek-prefill-profile 2>&1 | tee $RESULT_ROOT/dsv4_http_server_profile.log
@@ -125,8 +125,8 @@ The script is intentionally model-server agnostic at the HTTP layer. It only
requires an OpenAI-compatible `/v1/completions` endpoint that supports streaming
responses.
-The server trace columns are pegainfer-specific and require a pegainfer server
-log containing `pegainfer_http_trace` lines. The sweep fails when any cell has
+The server trace columns are openinfer-specific and require a openinfer server
+log containing `openinfer_http_trace` lines. The sweep fails when any cell has
request failures/timeouts or per-request output hashes that change across
repeats.
@@ -357,7 +357,7 @@ top-k as the largest remaining indexer-side bucket:
The equivalence gate is a GPU test against the current selector semantics:
```bash
-cargo test --release -p pegainfer-kernels \
+cargo test --release -p openinfer-kernels \
--features deepseek-v4 \
--test deepseek_indexer_topk -- --ignored --nocapture
```
diff --git a/docs/models/deepseek-v4/kernel-paths.md b/docs/models/deepseek-v4/kernel-paths.md
index e581719c..fbae16d1 100644
--- a/docs/models/deepseek-v4/kernel-paths.md
+++ b/docs/models/deepseek-v4/kernel-paths.md
@@ -8,22 +8,22 @@
- **Read**:
- `docs/index.md` - showed DeepSeek V4 support, kernel boundary, and Qwen3 kernel extraction as the relevant prior work.
- `docs/models/deepseek-v4/support.md` - confirmed DeepSeek V4 currently has native MP8 runtime, TileLang build-time kernels, exact E2E coverage, and a documented CUDA split by subsystem.
- - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - confirmed kernels belong in the shared kernels crate, while model DAG/runtime policy stays in the model crate.
- - `docs/models/qwen3/kernels-crate.md` - established the existing crate-first split and the role of `pegainfer-kernels/KERNELS.md`.
+ - `docs/subsystems/kernels/openinfer-kernels-boundary.md` - confirmed kernels belong in the shared kernels crate, while model DAG/runtime policy stays in the model crate.
+ - `docs/models/qwen3/kernels-crate.md` - established the existing crate-first split and the role of `openinfer-kernels/KERNELS.md`.
- `docs/conventions/coding-style.md` - reminded that GPU kernels deserve targeted tests, while broad behavior is better covered by integration/E2E.
- - `pegainfer-kernels/build.rs` - showed DeepSeek kernels are feature-gated by filename prefix in a flat `csrc/` scan, and TileLang generation was hard-coded to the old flat `tools/tilelang/gen_deepseek_v4_tilelang.py` path.
- - `pegainfer-kernels/KERNELS.md` - currently indexes Qwen3 and only mentions DeepSeek as compatibility symbols, so DSV4 has no routing table.
- - `pegainfer-kernels/csrc/deepseek_*.cu` and `pegainfer-kernels/csrc/deepseek_common.cuh` - confirmed the CUDA side is already split by subsystem but still lives in the root kernel source directory.
- - `pegainfer-deepseek-v4/src/runtime/*` - confirmed runtime calls reach DeepSeek symbols through `pegainfer_kernels::ffi`, so path cleanup should not require runtime API changes.
+ - `openinfer-kernels/build.rs` - showed DeepSeek kernels are feature-gated by filename prefix in a flat `csrc/` scan, and TileLang generation was hard-coded to the old flat `tools/tilelang/gen_deepseek_v4_tilelang.py` path.
+ - `openinfer-kernels/KERNELS.md` - currently indexes Qwen3 and only mentions DeepSeek as compatibility symbols, so DSV4 has no routing table.
+ - `openinfer-kernels/csrc/deepseek_*.cu` and `openinfer-kernels/csrc/deepseek_common.cuh` - confirmed the CUDA side is already split by subsystem but still lives in the root kernel source directory.
+ - `openinfer-deepseek-v4/src/runtime/*` - confirmed runtime calls reach DeepSeek symbols through `openinfer_kernels::ffi`, so path cleanup should not require runtime API changes.
- **Relevant history**:
- `docs/models/deepseek-v4/support.md` records that the current DeepSeek CUDA glue is intentionally split by subsystem; this cleanup should preserve that split instead of merging files.
- - `docs/models/qwen3/kernels-crate.md` moved kernel ownership into `pegainfer-kernels`; the same pattern supports moving model-specific source into a clearer subdirectory without changing model runtime ownership.
+ - `docs/models/qwen3/kernels-crate.md` moved kernel ownership into `openinfer-kernels`; the same pattern supports moving model-specific source into a clearer subdirectory without changing model runtime ownership.
- **Plan**:
- 1. First slice: move DeepSeek V4 CUDA sources from `pegainfer-kernels/csrc/deepseek_*.cu` and `deepseek_common.cuh` into `pegainfer-kernels/csrc/deepseek_v4/`, then update `pegainfer-kernels/build.rs` to discover CUDA files recursively and feature-gate DeepSeek by path instead of flat filename prefix.
+ 1. First slice: move DeepSeek V4 CUDA sources from `openinfer-kernels/csrc/deepseek_*.cu` and `deepseek_common.cuh` into `openinfer-kernels/csrc/deepseek_v4/`, then update `openinfer-kernels/build.rs` to discover CUDA files recursively and feature-gate DeepSeek by path instead of flat filename prefix.
2. Keep object file names stable or explicitly namespace them so `ar` input names remain collision-free when sources live in subdirectories.
3. Update include/rerun handling so `.cu` and `.cuh` changes under nested kernel directories trigger rebuilds.
- 4. Run low-cost verification for the first slice: `cargo fmt --all --check`, `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-kernels --features deepseek-v4`, and the non-DeepSeek default check if local CUDA/TileLang availability permits.
- 5. Record the result in this doc, then decide the next slice: likely moving the TileLang generator into a DeepSeek-specific tools path and adding a DSV4 section to `pegainfer-kernels/KERNELS.md`.
+ 4. Run low-cost verification for the first slice: `cargo fmt --all --check`, `OPENINFER_CUDA_SM=120 cargo check --release -p openinfer-kernels --features deepseek-v4`, and the non-DeepSeek default check if local CUDA/TileLang availability permits.
+ 5. Record the result in this doc, then decide the next slice: likely moving the TileLang generator into a DeepSeek-specific tools path and adding a DSV4 section to `openinfer-kernels/KERNELS.md`.
- **Risks / open questions**:
- Recursive source discovery can accidentally compile generated or third-party CUDA if scoped too broadly. It should only recurse under owned `csrc/`.
- DeepSeek TileLang requires a working TileLang Python; local verification may stop at environment setup rather than code correctness.
@@ -32,48 +32,48 @@
## Execution Log
### Step 1: Move DeepSeek V4 CUDA sources under a model-specific directory
-- Moved DeepSeek V4 CUDA sources from `pegainfer-kernels/csrc/deepseek_*.cu` and `pegainfer-kernels/csrc/deepseek_common.cuh` into `pegainfer-kernels/csrc/deepseek_v4/`.
-- Updated `pegainfer-kernels/build.rs` to collect owned `csrc/` files recursively, emit rebuild triggers for nested `.cu`/`.cuh` files, and generate object names from the relative source path so nested CUDA files do not collide with flat ones.
-- Replaced the build-script feature probe with `cfg!(feature = "deepseek-v4")`. Cargo feature resolution was checked with `cargo tree -p pegainfer-server --features deepseek-v4 -i pegainfer-kernels -e features`, which confirmed `pegainfer-server/deepseek-v4` enables `pegainfer-deepseek-v4/deepseek-v4` and then `pegainfer-kernels/deepseek-v4`.
-- Updated `pegainfer-kernels/KERNELS.md` and `docs/models/deepseek-v4/support.md` to point DeepSeek CUDA references at `csrc/deepseek_v4/`.
+- Moved DeepSeek V4 CUDA sources from `openinfer-kernels/csrc/deepseek_*.cu` and `openinfer-kernels/csrc/deepseek_common.cuh` into `openinfer-kernels/csrc/deepseek_v4/`.
+- Updated `openinfer-kernels/build.rs` to collect owned `csrc/` files recursively, emit rebuild triggers for nested `.cu`/`.cuh` files, and generate object names from the relative source path so nested CUDA files do not collide with flat ones.
+- Replaced the build-script feature probe with `cfg!(feature = "deepseek-v4")`. Cargo feature resolution was checked with `cargo tree -p openinfer-server --features deepseek-v4 -i openinfer-kernels -e features`, which confirmed `openinfer-server/deepseek-v4` enables `openinfer-deepseek-v4/deepseek-v4` and then `openinfer-kernels/deepseek-v4`.
+- Updated `openinfer-kernels/KERNELS.md` and `docs/models/deepseek-v4/support.md` to point DeepSeek CUDA references at `csrc/deepseek_v4/`.
Result: path move and build-script feature gating are in place.
Verification:
- `cargo fmt --all --check` passed.
-- `cargo tree -p pegainfer-server --features deepseek-v4 -i pegainfer-kernels -e features` confirmed feature forwarding from server to DeepSeek V4 model crate to kernels.
-- `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-kernels` passed. The build log confirmed DeepSeek V4 CUDA/TileLang kernels are disabled without the feature.
-- `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-kernels --features deepseek-v4` passed. The build log confirmed DeepSeek V4 TileLang CUDA generation under `target/.../tilelang/deepseek_v4/`.
-- `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-server --features deepseek-v4` passed, covering feature forwarding through the server, model crate, and kernels crate together.
+- `cargo tree -p openinfer-server --features deepseek-v4 -i openinfer-kernels -e features` confirmed feature forwarding from server to DeepSeek V4 model crate to kernels.
+- `OPENINFER_CUDA_SM=120 cargo check --release -p openinfer-kernels` passed. The build log confirmed DeepSeek V4 CUDA/TileLang kernels are disabled without the feature.
+- `OPENINFER_CUDA_SM=120 cargo check --release -p openinfer-kernels --features deepseek-v4` passed. The build log confirmed DeepSeek V4 TileLang CUDA generation under `target/.../tilelang/deepseek_v4/`.
+- `OPENINFER_CUDA_SM=120 cargo check --release -p openinfer-server --features deepseek-v4` passed, covering feature forwarding through the server, model crate, and kernels crate together.
### Step 2: Move the DeepSeek V4 TileLang generator under the shared TileLang backend directory
-- Moved `pegainfer-kernels/tools/tilelang/gen_deepseek_v4_tilelang.py` to `pegainfer-kernels/tools/tilelang/deepseek_v4/generate.py`.
-- Updated `pegainfer-kernels/build.rs` to run the generator from the new path.
+- Moved `openinfer-kernels/tools/tilelang/gen_deepseek_v4_tilelang.py` to `openinfer-kernels/tools/tilelang/deepseek_v4/generate.py`.
+- Updated `openinfer-kernels/build.rs` to run the generator from the new path.
- Updated the generated CUDA banner comment to point at the new generator path.
-- Added `pegainfer-kernels/tools/tilelang/README.md` to define `tools/tilelang/` as the shared TileLang backend directory, with model- or shape-family-specific generators in subdirectories.
+- Added `openinfer-kernels/tools/tilelang/README.md` to define `tools/tilelang/` as the shared TileLang backend directory, with model- or shape-family-specific generators in subdirectories.
- Updated `docs/models/deepseek-v4/support.md` to point at the new generator path.
Verification:
- `cargo fmt --all --check` passed.
-- `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-kernels --features deepseek-v4` passed. The build log showed DeepSeek V4 TileLang CUDA generation still succeeds after the path move.
-- `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-server --features deepseek-v4` passed after the generator move.
+- `OPENINFER_CUDA_SM=120 cargo check --release -p openinfer-kernels --features deepseek-v4` passed. The build log showed DeepSeek V4 TileLang CUDA generation still succeeds after the path move.
+- `OPENINFER_CUDA_SM=120 cargo check --release -p openinfer-server --features deepseek-v4` passed after the generator move.
### Step 3: Add the DeepSeek V4 kernel routing index
-- Added a `DeepSeek V4 MP8 Path` section to `pegainfer-kernels/KERNELS.md`.
-- Mapped runtime owners under `pegainfer-deepseek-v4/src/runtime/` to the public `pegainfer_kernels::ffi` symbols and their CUDA/TileLang source owners.
+- Added a `DeepSeek V4 MP8 Path` section to `openinfer-kernels/KERNELS.md`.
+- Mapped runtime owners under `openinfer-deepseek-v4/src/runtime/` to the public `openinfer_kernels::ffi` symbols and their CUDA/TileLang source owners.
- Grouped rows by execution subsystem rather than every individual shape: quant, attention, collectives cast helpers, indexer, compressor, HC, logits, and MoE.
- Kept TileLang shape details in source notes so the table remains a routing aid rather than a duplicate ABI declaration.
Verification:
-- `rg` over `pegainfer-kernels/src/ffi.rs`, `pegainfer-kernels/csrc/deepseek_v4/`, and `pegainfer-deepseek-v4/src/runtime/` was used to build the mapping.
-- `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-kernels` passed without `deepseek-v4`; the build log confirmed DeepSeek V4 CUDA/TileLang kernels are disabled on the default path.
-- `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-server` passed without `deepseek-v4`; the build log again confirmed the default server path skips DeepSeek V4 CUDA/TileLang.
+- `rg` over `openinfer-kernels/src/ffi.rs`, `openinfer-kernels/csrc/deepseek_v4/`, and `openinfer-deepseek-v4/src/runtime/` was used to build the mapping.
+- `OPENINFER_CUDA_SM=120 cargo check --release -p openinfer-kernels` passed without `deepseek-v4`; the build log confirmed DeepSeek V4 CUDA/TileLang kernels are disabled on the default path.
+- `OPENINFER_CUDA_SM=120 cargo check --release -p openinfer-server` passed without `deepseek-v4`; the build log again confirmed the default server path skips DeepSeek V4 CUDA/TileLang.
## Debrief
-- **Outcome**: DeepSeek V4 owned CUDA sources now live under `pegainfer-kernels/csrc/deepseek_v4/`. The DeepSeek V4 TileLang generator now lives under the shared TileLang backend directory at `pegainfer-kernels/tools/tilelang/deepseek_v4/generate.py`. The kernels build script recursively scans owned CUDA sources, skips DSV4 by path when `deepseek-v4` is disabled, uses `cfg!(feature = "deepseek-v4")` for the feature decision, namespaces object names by relative source path, and runs the TileLang generator from its new path. `pegainfer-kernels/KERNELS.md` now includes a DeepSeek V4 MP8 routing table from runtime owners to FFI symbols and CUDA/TileLang source paths.
+- **Outcome**: DeepSeek V4 owned CUDA sources now live under `openinfer-kernels/csrc/deepseek_v4/`. The DeepSeek V4 TileLang generator now lives under the shared TileLang backend directory at `openinfer-kernels/tools/tilelang/deepseek_v4/generate.py`. The kernels build script recursively scans owned CUDA sources, skips DSV4 by path when `deepseek-v4` is disabled, uses `cfg!(feature = "deepseek-v4")` for the feature decision, namespaces object names by relative source path, and runs the TileLang generator from its new path. `openinfer-kernels/KERNELS.md` now includes a DeepSeek V4 MP8 routing table from runtime owners to FFI symbols and CUDA/TileLang source paths.
- **Pitfalls encountered**:
- - The initial feature probe used `CARGO_FEATURE_DEEPSEEK_V4`; Cargo already forwards the feature into `pegainfer-kernels`, so `cfg!(feature = "deepseek-v4")` is clearer in `build.rs`.
+ - The initial feature probe used `CARGO_FEATURE_DEEPSEEK_V4`; Cargo already forwards the feature into `openinfer-kernels`, so `cfg!(feature = "deepseek-v4")` is clearer in `build.rs`.
- `cargo tree -e features` needs the reverse dependency form to show the exact feature forwarding path clearly.
- **Lessons learned**:
- Moving model-owned kernel source into a subdirectory is low-risk once build discovery is path-based rather than filename-prefix based.
diff --git a/docs/models/deepseek-v4/moe-ag-rs.md b/docs/models/deepseek-v4/moe-ag-rs.md
index d2ef63b7..9b08a30c 100644
--- a/docs/models/deepseek-v4/moe-ag-rs.md
+++ b/docs/models/deepseek-v4/moe-ag-rs.md
@@ -20,17 +20,17 @@ Decode MoE now uses GPU-resident allgather/router/local-expert/reduce-scatter fl
1. Keep production prefill group helpers, because `prefill_logits_and_decode_cache_group_bf16_hidden` is still called by the direct runtime.
2. Remove decode-only group entry points that support single-thread multi-rank decode: `block_decode_group_bf16_hidden`, `block_decode_group_rank_threads_bf16_hidden`, and their now-unused attention/MoE group helpers.
3. Remove public re-exports and mp8 manifest tests that only exercise the deleted decode group path.
- 4. Run `cargo fmt --check` and `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4`.
+ 4. Run `cargo fmt --check` and `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4`.
- **Risks / open questions**:
- Some tests currently use group decode as a small two-rank smoke path; deleting them narrows coverage to production rank-lane decode plus prefill group tests.
### Eliminate central single-thread direct runtime paths
- **Read**:
- - `pegainfer-deepseek-v4/src/direct.rs` - confirmed decode already uses persistent rank workers, but cache ownership is still moved through central runtime every token and prefill still uses central group helpers.
- - `pegainfer-deepseek-v4/src/runtime/block.rs` - identified rank-local decode structure and central prefill group cache seeding path.
- - `pegainfer-deepseek-v4/src/runtime/attention.rs` - identified ratio-4 prefill indexer all-reduce as the non-trivial collective that needs a rank-lane version.
- - `pegainfer-deepseek-v4/src/runtime/moe.rs` - identified prefill MoE routed all-reduce as another group-only collective to move into rank lanes.
+ - `openinfer-deepseek-v4/src/direct.rs` - confirmed decode already uses persistent rank workers, but cache ownership is still moved through central runtime every token and prefill still uses central group helpers.
+ - `openinfer-deepseek-v4/src/runtime/block.rs` - identified rank-local decode structure and central prefill group cache seeding path.
+ - `openinfer-deepseek-v4/src/runtime/attention.rs` - identified ratio-4 prefill indexer all-reduce as the non-trivial collective that needs a rank-lane version.
+ - `openinfer-deepseek-v4/src/runtime/moe.rs` - identified prefill MoE routed all-reduce as another group-only collective to move into rank lanes.
- **Relevant history**:
- Step 6 removed decode group entry points, but prefill/cache ownership still left a central multi-rank path that future changes could accidentally follow.
- **Plan**:
@@ -49,9 +49,9 @@ Decode MoE now uses GPU-resident allgather/router/local-expert/reduce-scatter fl
- **Relevant history**:
- `docs/models/deepseek-v4/moe-tilelang-review.md` records that replacing local expert execution is a larger cutover; this task intentionally only adds the regular collective exchange primitives.
- **Plan**:
- 1. Add `all_gather_bf16_hidden_group` and `reduce_scatter_f32_hidden_group` in `pegainfer-deepseek-v4/src/runtime/collectives.rs` with explicit shape checks.
- 2. Export the new collectives from `pegainfer-deepseek-v4/src/lib.rs`.
- 3. Add a focused NCCL pair test in `pegainfer-deepseek-v4/tests/mp8_manifest.rs` that validates `[B_local,H] -> [world*B_local,H]` allgather and `[world*B_local,H] -> [B_local,H]` f32 reduce-scatter.
+ 1. Add `all_gather_bf16_hidden_group` and `reduce_scatter_f32_hidden_group` in `openinfer-deepseek-v4/src/runtime/collectives.rs` with explicit shape checks.
+ 2. Export the new collectives from `openinfer-deepseek-v4/src/lib.rs`.
+ 3. Add a focused NCCL pair test in `openinfer-deepseek-v4/tests/mp8_manifest.rs` that validates `[B_local,H] -> [world*B_local,H]` allgather and `[world*B_local,H] -> [B_local,H]` f32 reduce-scatter.
4. Run the targeted test or compile check with release settings.
- **Risks / open questions**:
- The pair test requires two GPUs and a loadable NCCL runtime; it should skip cleanly when NCCL is unavailable, matching the existing all-reduce test.
@@ -59,30 +59,30 @@ Decode MoE now uses GPU-resident allgather/router/local-expert/reduce-scatter fl
## Execution Log
### Step 6: Remove legacy decode group path
-- Removed decode-only single-thread/multi-rank entry points from `pegainfer-deepseek-v4/src/runtime/block.rs`:
+- Removed decode-only single-thread/multi-rank entry points from `openinfer-deepseek-v4/src/runtime/block.rs`:
- `block_decode_group_bf16_hidden`
- `block_decode_group_rank_threads_bf16_hidden`
- Removed decode group helpers that only existed for those entry points:
- - attention decode group wrappers in `pegainfer-deepseek-v4/src/runtime/attention.rs`
- - `decode_moe_ag_rs_group_bf16_hidden` in `pegainfer-deepseek-v4/src/runtime/moe.rs`
+ - attention decode group wrappers in `openinfer-deepseek-v4/src/runtime/attention.rs`
+ - `decode_moe_ag_rs_group_bf16_hidden` in `openinfer-deepseek-v4/src/runtime/moe.rs`
- AG/RS group collectives `all_gather_bf16_hidden_group`, `all_gather_u32_group`, and `reduce_scatter_f32_hidden_group`
- Removed public re-exports and mp8 manifest tests that referenced the deleted decode group path.
- Kept production prefill group helpers, because direct prefill still calls `prefill_logits_and_decode_cache_group_bf16_hidden`.
- Verification:
- `cargo fmt`
- `cargo fmt --check` passed
- - `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` passed
- - `rg -n "block_decode_group|group_rank_threads|attention_decode_group|decode_moe_ag_rs_group|all_gather_bf16_hidden_group|reduce_scatter_f32_hidden_group|all_gather_u32_group" pegainfer-deepseek-v4/src pegainfer-deepseek-v4/tests` returned no matches
+ - `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` passed
+ - `rg -n "block_decode_group|group_rank_threads|attention_decode_group|decode_moe_ag_rs_group|all_gather_bf16_hidden_group|reduce_scatter_f32_hidden_group|all_gather_u32_group" openinfer-deepseek-v4/src openinfer-deepseek-v4/tests` returned no matches
### Step 7: Remote exact E2E after cleanup
-- Synced the cleanup files back to `5090:$PEGAINFER_DIR`.
+- Synced the cleanup files back to `5090:$OPENINFER_DIR`.
- Verified model path on 5090: `$MODEL_DIR`.
- Ran on 5090:
```bash
source ~/.cargo/env 2>/dev/null || true
-cd $PEGAINFER_DIR
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- --model-path $MODEL_DIR
+cd $OPENINFER_DIR
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- --model-path $MODEL_DIR
```
- Result:
@@ -105,23 +105,23 @@ PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features de
- mp8 tests that exercised the deleted group path
- Verification:
- `cargo fmt` passed locally
- - `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` passed locally and on 5090
- - `cargo test --release -p pegainfer-deepseek-v4 --features deepseek-v4 --test mp8_manifest --no-run` passed locally
- - `rg -n "group_start|group_end|all_reduce_hidden_group|all_gather_logits_group|embedding_vocab_parallel_group|final_logits_group_bf16_hidden|hash_routed_moe_group_bf16_hidden|moe_group_bf16_hidden|attention_prefill_.*group|block_prefill_group|prefill_logits_group|prefill_logits_and_decode_cache_group|deepseek_mp8_check|contexts: Vec" pegainfer-deepseek-v4/src pegainfer-deepseek-v4/tests pegainfer-deepseek-v4/Cargo.toml` returned no matches locally
+ - `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` passed locally and on 5090
+ - `cargo test --release -p openinfer-deepseek-v4 --features deepseek-v4 --test mp8_manifest --no-run` passed locally
+ - `rg -n "group_start|group_end|all_reduce_hidden_group|all_gather_logits_group|embedding_vocab_parallel_group|final_logits_group_bf16_hidden|hash_routed_moe_group_bf16_hidden|moe_group_bf16_hidden|attention_prefill_.*group|block_prefill_group|prefill_logits_group|prefill_logits_and_decode_cache_group|deepseek_mp8_check|contexts: Vec" openinfer-deepseek-v4/src openinfer-deepseek-v4/tests openinfer-deepseek-v4/Cargo.toml` returned no matches locally
- 5090 exact E2E with `$MODEL_DIR` passed: `All 20 DeepSeek V4 exact cases passed`
### Step 9: Split direct scheduler and worker files
-- Split the former monolithic `pegainfer-deepseek-v4/src/direct.rs` into:
- - `pegainfer-deepseek-v4/src/direct.rs` as a thin module/re-export facade.
- - `pegainfer-deepseek-v4/src/direct/scheduler.rs` for request validation, the single-request greedy scheduler loop, token event emission, and sampling.
- - `pegainfer-deepseek-v4/src/direct/worker.rs` for rank worker commands, rank resource ownership, cache/RoPE management, per-rank prefill/decode execution, and rank-0 logits collection.
+- Split the former monolithic `openinfer-deepseek-v4/src/direct.rs` into:
+ - `openinfer-deepseek-v4/src/direct.rs` as a thin module/re-export facade.
+ - `openinfer-deepseek-v4/src/direct/scheduler.rs` for request validation, the single-request greedy scheduler loop, token event emission, and sampling.
+ - `openinfer-deepseek-v4/src/direct/worker.rs` for rank worker commands, rank resource ownership, cache/RoPE management, per-rank prefill/decode execution, and rank-0 logits collection.
- Renamed the request/scheduler thread from `deepseek-v4-direct` to `deepseek-v4-scheduler`. Rank worker thread names remain `deepseek-v4-rank-{rank}`.
- Kept behavior unchanged; this is only a responsibility-boundary cleanup.
- Follow-up naming debt:
- The module and public type names still use `direct`; that name is legacy and should eventually become `engine` or `executor` in a dedicated rename pass.
- Verification:
- `cargo fmt` passed
- - `cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4` passed
+ - `cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4` passed
### Step 3: Expand scope to decode backend replacement
- User goal changed from adding standalone AG/RS collectives to completing the MoE all-to-all backend replacement and passing DeepSeek V4 E2E.
@@ -146,7 +146,7 @@ PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features de
- Exact E2E command run:
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- --model-path $MODEL_DIR
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- --model-path $MODEL_DIR
```
- Result: `19 / 20` exact cases passed.
@@ -155,7 +155,7 @@ PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features de
- Performance sanity command run:
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_serving --features deepseek-v4 -- --model-path $MODEL_DIR --format json request --prompt-len 1 --output-len 32 --warmup 1 --iters 1
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-server --bin bench_serving --features deepseek-v4 -- --model-path $MODEL_DIR --format json request --prompt-len 1 --output-len 32 --warmup 1 --iters 1
```
- Result:
@@ -180,9 +180,9 @@ PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_servin
- Validation commands:
```bash
-PEGAINFER_NVCC_JOBS=8 cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- --model-path $MODEL_DIR
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_serving --features deepseek-v4 -- --model-path $MODEL_DIR --format json request --prompt-len 1 --output-len 32 --warmup 1 --iters 1
+OPENINFER_NVCC_JOBS=8 cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- --model-path $MODEL_DIR
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-server --bin bench_serving --features deepseek-v4 -- --model-path $MODEL_DIR --format json request --prompt-len 1 --output-len 32 --warmup 1 --iters 1
```
- Results:
@@ -192,22 +192,22 @@ PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_servin
- Earlier row-routed scalar path measured around `223.30ms`, so grouped TileLang cuts the local expert bottleneck roughly in half.
### Step 1: Add AG/RS collectives
-- Added `all_gather_bf16_hidden_group` in `pegainfer-deepseek-v4/src/runtime/collectives.rs`.
+- Added `all_gather_bf16_hidden_group` in `openinfer-deepseek-v4/src/runtime/collectives.rs`.
- Contract: every rank contributes `bf16 [B_local,H]`.
- Output on each rank: `bf16 [world*B_local,H]`.
- Uses NCCL `Comm::all_gather` on device buffers; no runtime D2H metadata.
-- Added `reduce_scatter_f32_hidden_group` in `pegainfer-deepseek-v4/src/runtime/collectives.rs`.
+- Added `reduce_scatter_f32_hidden_group` in `openinfer-deepseek-v4/src/runtime/collectives.rs`.
- Contract: every rank contributes `f32 [world*B_local,H]`.
- Output on each rank: `f32 [B_local,H]`.
- Uses NCCL `Comm::reduce_scatter(..., ReduceOp::Sum)` on device buffers.
-- Exported both helpers from `pegainfer-deepseek-v4/src/lib.rs`.
+- Exported both helpers from `openinfer-deepseek-v4/src/lib.rs`.
### Step 2: Validate
-- Added `nccl_hidden_all_gather_and_reduce_scatter_pair` in `pegainfer-deepseek-v4/tests/mp8_manifest.rs`.
+- Added `nccl_hidden_all_gather_and_reduce_scatter_pair` in `openinfer-deepseek-v4/tests/mp8_manifest.rs`.
- Ran:
```bash
-PEGAINFER_NVCC_JOBS=8 cargo test --release -p pegainfer-deepseek-v4 --features deepseek-v4 --test mp8_manifest nccl_hidden_all_gather_and_reduce_scatter_pair -- --nocapture
+OPENINFER_NVCC_JOBS=8 cargo test --release -p openinfer-deepseek-v4 --features deepseek-v4 --test mp8_manifest nccl_hidden_all_gather_and_reduce_scatter_pair -- --nocapture
```
- Result: passed, `1 passed; 0 failed; 23 filtered out`.
diff --git a/docs/models/deepseek-v4/moe-tilelang-review.md b/docs/models/deepseek-v4/moe-tilelang-review.md
index eb324596..a86c5b22 100644
--- a/docs/models/deepseek-v4/moe-tilelang-review.md
+++ b/docs/models/deepseek-v4/moe-tilelang-review.md
@@ -20,13 +20,13 @@
- **Read**:
- `docs/index.md` - identified DeepSeek V4 support, DeepSeek kernel paths, and kernel technology reference as the relevant routing docs.
- `docs/models/deepseek-v4/support.md` - confirmed the current DeepSeek V4 path has native MP8 runtime, TileLang build-time kernels, and a handwritten CUDA MoE path; it also notes MoE route-index D2H synchronization as a higher-risk remaining target.
- - `docs/models/deepseek-v4/kernel-paths.md` - confirmed DeepSeek CUDA sources now live under `pegainfer-kernels/csrc/deepseek_v4/` and TileLang generators live under `pegainfer-kernels/tools/tilelang/deepseek_v4/`.
+ - `docs/models/deepseek-v4/kernel-paths.md` - confirmed DeepSeek CUDA sources now live under `openinfer-kernels/csrc/deepseek_v4/` and TileLang generators live under `openinfer-kernels/tools/tilelang/deepseek_v4/`.
- **Relevant history**:
- `docs/models/deepseek-v4/support.md` records that the current local TileLang generator emits quantized linear, sparse attention, and HC kernels, while `deepseek_moe.cu` owns routing, SwiGLU, and expert accumulation.
- `docs/models/deepseek-v4/kernel-paths.md` records that the DeepSeek kernel routing table was recently organized, so this review should compare against those paths instead of rediscovering ownership from scratch.
- **Plan**:
1. Inspect the official DeepSeek TileKernels `tile_kernels/moe` directory from `https://github.com/deepseek-ai/TileKernels/tree/main/tile_kernels/moe`, including file names, exported kernels, and expected tensor layouts.
- 2. Inspect local MoE code paths: `pegainfer-kernels/csrc/deepseek_v4/deepseek_moe.cu`, related FFI declarations, and `pegainfer-deepseek-v4/src/runtime/` callers.
+ 2. Inspect local MoE code paths: `openinfer-kernels/csrc/deepseek_v4/deepseek_moe.cu`, related FFI declarations, and `openinfer-deepseek-v4/src/runtime/` callers.
3. Compare official kernels against local behavior along routing layout, expert grouping, quantization format, activation, accumulation dtype, and dispatch/combine boundaries.
4. Summarize what official TileLang operators exist, what they appear to solve, and which local MoE issue they most likely explain or do not explain.
5. If the gap is clear and small, propose the first implementation slice; otherwise stop with a focused diagnostic checklist.
@@ -68,13 +68,13 @@
Result: official MoE TileLang is primarily a routing, mapping, packing, and reduction toolkit. It does not appear to provide a single fused FP4 expert MLP kernel in `tile_kernels/moe/`; the fused expert GEMM path is implied by the expert-major layout and the quantized helpers used around it.
### Step 2: Compare local MoE implementation
-- Local score routing in `pegainfer-kernels/csrc/deepseek_v4/deepseek_moe.cu` broadly matches the model config's simple scoring semantics:
+- Local score routing in `openinfer-kernels/csrc/deepseek_v4/deepseek_moe.cu` broadly matches the model config's simple scoring semantics:
- BF16 gate scores are converted to F32 and multiplied through cuBLAS;
- selection score is `sqrt(softplus(dot)) + gate_bias`;
- route weight is the original `sqrt(softplus(dot))`;
- selected weights are normalized and multiplied by `routed_scaling_factor`.
- Local execution differs substantially from the official fused-layout path:
- - `pegainfer-deepseek-v4/src/runtime/moe.rs` copies `routed.indices` from device to host and synchronizes in both `routed_local_experts_forward_bf16_hidden` and `routed_local_experts_forward_f32_hidden`.
+ - `openinfer-deepseek-v4/src/runtime/moe.rs` copies `routed.indices` from device to host and synchronizes in both `routed_local_experts_forward_bf16_hidden` and `routed_local_experts_forward_f32_hidden`.
- The CPU then builds `active_local` and loops over local experts.
- For each active local expert, `local_expert_forward_*` runs W1, W3, SwiGLU, and W2 over the full input batch, then masks/weights the result by route index.
- Official TileKernels instead keeps routing metadata on GPU, creates expert-major packed token ranges, expands inputs once, executes expert work over packed ranges, then reduces back with `token_topk_to_pos`.
@@ -129,7 +129,7 @@ Result: the most likely MoE problem is not the `sqrtsoftplus` math itself for no
- Non-nsys synthetic decode-heavy command:
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_serving --features deepseek-v4 -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-server --bin bench_serving --features deepseek-v4 -- \
--model-path $MODEL_DIR --format json \
request --prompt-len 1 --output-len 32 --warmup 1 --iters 1
```
@@ -221,13 +221,13 @@ nsys profile --stats=false --force-overwrite=true \
- Validation:
```bash
-cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4
+cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4
```
- Passed with the existing unreachable-pub warnings in `runtime/core.rs` and `runtime/state.rs`.
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_serving --features deepseek-v4 -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-server --bin bench_serving --features deepseek-v4 -- \
--model-path $MODEL_DIR --format json \
request --prompt-len 1 --output-len 32 --warmup 1 --iters 1
```
@@ -241,7 +241,7 @@ PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_servin
- e2e: `3.69s`
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- \
--model-path $MODEL_DIR \
--ground-truth test_data/deepseek-v4-ground-truth.json \
--offset 0 --limit 1 --max-new-tokens 64
@@ -251,7 +251,7 @@ PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features de
- Full exact validation required before commit:
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- \
--model-path $MODEL_DIR \
--ground-truth test_data/deepseek-v4-ground-truth.json \
--max-new-tokens 64
@@ -300,13 +300,13 @@ nsys profile --stats=false --force-overwrite=true \
- Validation:
```bash
-cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4
+cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4
```
- Passed with the existing unreachable-pub warnings in `runtime/core.rs` and `runtime/state.rs`.
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_serving --features deepseek-v4 -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-server --bin bench_serving --features deepseek-v4 -- \
--model-path $MODEL_DIR --format json \
request --prompt-len 1 --output-len 32 --warmup 1 --iters 1
```
@@ -320,7 +320,7 @@ PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_servin
- e2e: `3.11s`
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- \
--model-path $MODEL_DIR \
--ground-truth test_data/deepseek-v4-ground-truth.json \
--max-new-tokens 64
@@ -406,13 +406,13 @@ nsys profile --stats=false --force-overwrite=true \
- Interpretation: MoE EP8 route imbalance creates some phase skew, but the 100ms-scale TPOT comes from many small phase skews being paid at every barrier. Attention and indexer collectives also pay large skew despite nearly equal active GPU work, which points at CPU runtime, allocation/free, launch gaps, and host-controlled loops as the amplification mechanism.
- Next experiments should measure or reduce runtime churn and host-controlled MoE decode scheduling before assuming expert compute or raw NCCL bandwidth is the limiting factor.
- Temporary NVTX proof trace:
- - Added a temporary profiling-only NVTX loader using runtime `dlopen`/`dlsym` for `nvtxRangePushA`, `nvtxRangePop`, and `nvtxMarkA`, gated by `PEGAINFER_DSV4_NVTX=1`. The instrumentation marked rank worker decode stages (`attn_local`, `indexer_ar`, `attention_ar`, `moe_route`, `moe_plan`, `moe_experts`, `moe_reduce`, `shared_expert`, `moe_ar`) plus active local expert counts and per-local-expert ranges. The temporary code was removed after the trace, so it is not part of the hot path.
+ - Added a temporary profiling-only NVTX loader using runtime `dlopen`/`dlsym` for `nvtxRangePushA`, `nvtxRangePop`, and `nvtxMarkA`, gated by `OPENINFER_DSV4_NVTX=1`. The instrumentation marked rank worker decode stages (`attn_local`, `indexer_ar`, `attention_ar`, `moe_route`, `moe_plan`, `moe_experts`, `moe_reduce`, `shared_expert`, `moe_ar`) plus active local expert counts and per-local-expert ranges. The temporary code was removed after the trace, so it is not part of the hot path.
- Build and trace commands:
```bash
-PEGAINFER_NVCC_JOBS=8 cargo build --release -p pegainfer-server --bin bench_serving --features deepseek-v4
+OPENINFER_NVCC_JOBS=8 cargo build --release -p openinfer-server --bin bench_serving --features deepseek-v4
-PEGAINFER_DSV4_NVTX=1 nsys profile --stats=false --force-overwrite=true \
+OPENINFER_DSV4_NVTX=1 nsys profile --stats=false --force-overwrite=true \
--trace=cuda,nvtx,osrt --cuda-graph-trace=node \
--delay=34 --duration=12 \
-o target/profiling/dsv4_rank_stage_proof \
@@ -460,13 +460,13 @@ nsys export --type sqlite --force-overwrite=true \
- Validation:
```bash
-cargo check --release -p pegainfer-deepseek-v4 --features deepseek-v4
+cargo check --release -p openinfer-deepseek-v4 --features deepseek-v4
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_serving --features deepseek-v4 -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-server --bin bench_serving --features deepseek-v4 -- \
--model-path $MODEL_DIR --format json \
request --prompt-len 1 --output-len 32 --warmup 1 --iters 1
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-deepseek-v4 --features deepseek-v4 --bin deepseek_v4_e2e -- \
--model-path $MODEL_DIR \
--ground-truth test_data/deepseek-v4-ground-truth.json \
--max-new-tokens 64
diff --git a/docs/models/deepseek-v4/online-throughput.md b/docs/models/deepseek-v4/online-throughput.md
index 9e7f712a..0ab3b8fa 100644
--- a/docs/models/deepseek-v4/online-throughput.md
+++ b/docs/models/deepseek-v4/online-throughput.md
@@ -136,12 +136,12 @@ Nsight Systems 10k direct, sorted by CUDA kernel total time:
| Decode MoE | `dispatch_decode_moe_step` accepts `input.seq_len`; local routing and grouped GEMM operate over seq length. | not first candidate until active-set path proves MoE is a top online bucket. |
| Prefill request batching | prefill starts one request into one KV slot; no multi-request prefill active set. | input throughput is mostly long-seq single-request kernel efficiency today; true batch prefill needs a larger scheduler/runtime shape change. |
| Prefill attention/compressor | prefill kernels are sequence-parallel for one request; no native multi-request DSV4 prefill stack. | Pacer prefill replacements should target high-share single-request buckets first, especially non-overlap compressor, while preserving the chosen quality policy. |
-| CUDA Graph | `pegainfer-server` passes `enable_cuda_graph=false` for DeepSeek V4; direct engine warns that DSV4 does not use CUDA Graph yet. | graph work starts after active-set shapes stabilize; blockers are dynamic seq/compressed lengths, collectives, stream/event ownership, allocator/scratch lifetimes, and batch capacity. |
+| CUDA Graph | `openinfer-server` passes `enable_cuda_graph=false` for DeepSeek V4; direct engine warns that DSV4 does not use CUDA Graph yet. | graph work starts after active-set shapes stabilize; blockers are dynamic seq/compressed lengths, collectives, stream/event ownership, allocator/scratch lifetimes, and batch capacity. |
## Next Work Selection
| Task | Owner | Entry |
| --- | --- | --- |
-| task #45 HTTP active-set batching + CUDA Graph serving gate | @PegaInfer-Dev | Make serving trace show active set size > 1 under c2/c4/c8, then measure output tok/s/TPOT against this baseline. |
+| task #45 HTTP active-set batching + CUDA Graph serving gate | @OpenInfer-Dev | Make serving trace show active set size > 1 under c2/c4/c8, then measure output tok/s/TPOT against this baseline. |
| task #46 decode operator replacement | @Pacer | Prefer decode compressor `_batch_` path from task #44 coverage; fallback is decode indexer top-k batch if compressor exactness blocks. |
| task #46 prefill operator replacement | @Pacer | Prefer non-overlap compressor only when local microbench/correctness and precision review show meaningful input-throughput gain; skip low-yield patches. |
diff --git a/docs/models/deepseek-v4/pplx-ep-integration.md b/docs/models/deepseek-v4/pplx-ep-integration.md
index 13b66e2f..955ad10f 100644
--- a/docs/models/deepseek-v4/pplx-ep-integration.md
+++ b/docs/models/deepseek-v4/pplx-ep-integration.md
@@ -14,15 +14,15 @@
- **PPLX worker/protocol 证据**:worker-wait NVTX profile 把每层 `p2p_all_to_all` p50 拆到 **1.609 ms**,乘 43 层解释非 rank0 的 74ms 级;其中 `worker_wait_combine_recv_done` p50 **1.111 ms/layer**,`dispatch` p50 只有 **0.010 ms/layer**。per-token source sync、worker-derived active-source mask、early `tx_ready`、route processing overlap 等局部实验均失败或 wait 迁移。
- **direct routed 只作机制证据**:single-node peer-memory direct routed path 绕过 legacy PPLX 四阶段,H200 `output_len=64` p50 从 **144.00 ms** 降到 **83.94 ms**,rows512 后到 **78.68 / 77.33 ms**;clean profile 为 PPLX **79.08 ms** vs NCCL **63.17 ms**。该路径是绕过 upstream 四阶段语义的 hack,当前代码已回到 legacy four-stage,并清理 `a2a_direct_*` API/kernel、direct worker mode、debug counters 和高侵入 RDMA/fabric probes。
- **当前关键修复**:2026-05-17 `/proc` 采样坐实 CPU0 fabric worker 抢占:旧 CPU0 `tx_engine_domain` 在 **7.0s** decode 窗口只拿到 **3.60s** CPU 且 **2980** 次 nonvoluntary switch。把 rank0 TE worker 从 CPU0 挪到同 topology group 的 CPU10 后,两次 H200 `output_len=64` 复测降到 steady p50 **66.46 / 66.70 ms**、p95 **69.80 / 69.62 ms**,接近 NCCL **63 ms** 级。
-- **当前代码状态**:legacy 四阶段 kernel 已恢复 cooperative multi-block launch;保留 done-flag 最后发布 correctness 修复。dsv4 侧临时 NVTX ranges 已清理,pplx-garden 自带 NVTX 保留;CPU placement 迁到 `pegainfer_core::cpu_topology`:读取 CUDA device NUMA、当前 affinity mask、NUMA cpulist,把同一 NUMA 下的 rank 先均分连续 CPU slice,再从 slice 内分配 rank/a2a/TE/UVM。CPU0 保留不用,CPU1 给 scheduler;启动时每个 rank 打一行 `cpu_slice/rank_worker/TE/a2a/UVM`。direct routed hack、临时 profiler API capture 和高侵入诊断均已清理。
-- **验证状态**:本地 `cargo test --release -p pegainfer-core cpu_topology -- --nocapture`、`cargo fmt --check -p pegainfer-core -p pegainfer-comm-fabric-lib -p pegainfer-deepseek-v4`、`PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep-bench --bin deepseek_pplx_a2a_bench`、`PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` 均通过。H200 release build 通过;`output_len=2` smoke status 0;`output_len=64` 生成 64/64 token 后 teardown status 139,metrics 已打印:steady p50 **66.65 ms**、p95 **68.15 ms**、max **69.47 ms**。PPLX exact ground truth 19/20;case 13 输出 `2500` 而非 `2500 meters`,NCCL 同 case 同样失败,因此不归因到 PPLX placement/通信改动。
+- **当前代码状态**:legacy 四阶段 kernel 已恢复 cooperative multi-block launch;保留 done-flag 最后发布 correctness 修复。dsv4 侧临时 NVTX ranges 已清理,pplx-garden 自带 NVTX 保留;CPU placement 迁到 `openinfer_core::cpu_topology`:读取 CUDA device NUMA、当前 affinity mask、NUMA cpulist,把同一 NUMA 下的 rank 先均分连续 CPU slice,再从 slice 内分配 rank/a2a/TE/UVM。CPU0 保留不用,CPU1 给 scheduler;启动时每个 rank 打一行 `cpu_slice/rank_worker/TE/a2a/UVM`。direct routed hack、临时 profiler API capture 和高侵入诊断均已清理。
+- **验证状态**:本地 `cargo test --release -p openinfer-core cpu_topology -- --nocapture`、`cargo fmt --check -p openinfer-core -p openinfer-comm-fabric-lib -p openinfer-deepseek-v4`、`PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-deepseek-v4 --features pplx-ep-bench --bin deepseek_pplx_a2a_bench`、`PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` 均通过。H200 release build 通过;`output_len=2` smoke status 0;`output_len=64` 生成 64/64 token 后 teardown status 139,metrics 已打印:steady p50 **66.65 ms**、p95 **68.15 ms**、max **69.47 ms**。PPLX exact ground truth 19/20;case 13 输出 `2500` 而非 `2500 meters`,NCCL 同 case 同样失败,因此不归因到 PPLX placement/通信改动。
- **Next**:在当前 per-NUMA slice placement 上做低侵入 profile,复核 legacy PPLX 相对 NCCL 剩余 **~3-4 ms** gap 和 rank0 drain 结构。
## TL;DR
-**2026-05-17 cleanup note**:direct routed single-node path 的 77-84ms 数据只作为机制证据保留;实现已回到 legacy four-stage PPLX。最新整理把 CPU placement 公共化到 `pegainfer_core::cpu_topology`,按 NUMA cpulist 给 rank 连续切片,CPU0 不用、CPU1 给 scheduler,其它 worker 从本 rank slice 内取。bench 的 pplx bootstrap 入口改成隐藏 wrapper,避免暴露 direct 内部 placement 类型;dsv4 临时 NVTX ranges 已清理,pplx 自带 NVTX 不动。本地 release check 和 `cpu_topology` 单测通过;H200 `output_len=64` 复测 steady p50 **66.65 ms**、p95 **68.15 ms**,退出码 **139** 发生在 metrics 之后的已知 teardown 阶段。
+**2026-05-17 cleanup note**:direct routed single-node path 的 77-84ms 数据只作为机制证据保留;实现已回到 legacy four-stage PPLX。最新整理把 CPU placement 公共化到 `openinfer_core::cpu_topology`,按 NUMA cpulist 给 rank 连续切片,CPU0 不用、CPU1 给 scheduler,其它 worker 从本 rank slice 内取。bench 的 pplx bootstrap 入口改成隐藏 wrapper,避免暴露 direct 内部 placement 类型;dsv4 临时 NVTX ranges 已清理,pplx 自带 NVTX 不动。本地 release check 和 `cpu_topology` 单测通过;H200 `output_len=64` 复测 steady p50 **66.65 ms**、p95 **68.15 ms**,退出码 **139** 发生在 metrics 之后的已知 teardown 阶段。
-把 `pegainfer-comm`(pplx-garden 派生)的 NVLink + RDMA MoE all-to-all 后端从 skeleton 接成可用实现,给 dsv4-flash **decode MoE** 提供另一条通信路径,运行时通过开关切换到它;走 pplx 时 decode CUDA Graph 全局关闭。范围只覆盖 routed expert 这一段 dispatch/combine,prefill、shared expert、attention、indexer 不动。**不**引入 trait/dyn 抽象——只有一个实现,直接用 concrete 类型。
+把 `openinfer-comm`(pplx-garden 派生)的 NVLink + RDMA MoE all-to-all 后端从 skeleton 接成可用实现,给 dsv4-flash **decode MoE** 提供另一条通信路径,运行时通过开关切换到它;走 pplx 时 decode CUDA Graph 全局关闭。范围只覆盖 routed expert 这一段 dispatch/combine,prefill、shared expert、attention、indexer 不动。**不**引入 trait/dyn 抽象——只有一个实现,直接用 concrete 类型。
## 工作场景
@@ -32,7 +32,7 @@
## 现状(读码确认过的事实)
-### pegainfer-comm 公共表面(skeleton)
+### openinfer-comm 公共表面(skeleton)
- `EpAllToAll` trait:`dispatch / combine / poll / release` 四个 `&self` 方法,对象安全,`Send + Sync`。
- `EpBackendBuilder::build()`:**两种 feature 模式都返回 Err**——`hw-rdma` off 时 `BackendUnavailable`,`hw-rdma` on 时 `Unimplemented`。
@@ -41,7 +41,7 @@
- `SendBuf / RecvBuf`:裸 device pointer + elem_count + elem_size + 可选 scale pointer;调用方持有底层 allocation 的所有权。
- `RdmaBackend`(`src/backend/rdma.rs`):私有类型,四个 trait 方法全是 `todo!()`,构造函数当前只存了 `EpTopology`,没拿 `AllToAllContext`。
-### pplx wrapper(`crates/pegainfer-comm-p2p-all-to-all/`)
+### pplx wrapper(`crates/openinfer-comm-p2p-all-to-all/`)
- `AllToAllContext::new(...)`:21 个参数,需要外部传入 `TransferEngine`、`rank_handles`、预注册的 send/recv buffer + MR、host pointer arrays(sync/send/recv),构造时启动一个 `"p2p_all_to_all Worker"` 后台线程,固定 CPU 亲和性。
- 调用形态是 **四步**(不是 trait 现在写的两步):
@@ -53,7 +53,7 @@
### dsv4-flash 当前 MoE 通信路径
-- decode:`pegainfer-deepseek-v4/src/runtime/moe.rs:1323` 的 `decode_moe_ag_rs_bf16_hidden_with_scratch`
+- decode:`openinfer-deepseek-v4/src/runtime/moe.rs:1323` 的 `decode_moe_ag_rs_bf16_hidden_with_scratch`
- NCCL `all_gather_bf16_hidden_into`(拼全局 hidden)
- 本地路由 + grouped FP4 GEMM(local experts)
- NCCL `reduce_scatter_f32_hidden_into`(聚合到本地 token)
@@ -61,7 +61,7 @@
- prefill:`moe.rs:1289` 的 `moe_rank_lane_bf16_hidden`
- 路由 → expand → grouped GEMM → reduce → `all_reduce_f32_hidden_in_place`
- 也是 all-reduce 形态,不是 A2A。
-- 通信抽象层:`pegainfer-deepseek-v4/src/runtime/collectives.rs` 包了一组 NCCL `Comm`-based helper,所有 MoE 通信都过它。
+- 通信抽象层:`openinfer-deepseek-v4/src/runtime/collectives.rs` 包了一组 NCCL `Comm`-based helper,所有 MoE 通信都过它。
- 没有任何 dispatch/combine 形态的接口,**需要新增**而不是替换。
### 不做的事
@@ -81,14 +81,14 @@ dsv4-flash MoE (rank lane)
│
├── 走 NCCL AG/RS(已有)—— CUDA Graph 友好
│
- └── 走 pegainfer-comm(新增)—— eager only,graph 关闭
+ └── 走 openinfer-comm(新增)—— eager only,graph 关闭
│
└── EpBackend → AllToAllContext → a2a-kernels + fabric-lib
```
- 切换粒度:**整 process 一致**,由 CLI/Config 决定,启动后不变;同一 layer 不会跨后端。
-### pegainfer-comm 表面简化
+### openinfer-comm 表面简化
skeleton 里的 `EpAllToAll` trait + `Box` 删掉。`EpBackend` 改成 concrete 结构,inherent 方法直接暴露四步:
@@ -107,7 +107,7 @@ impl EpBackend {
### dsv4 集成入口
-新增 `pegainfer-deepseek-v4/src/runtime/moe_pplx.rs`(flat layout,无 `mod.rs`):
+新增 `openinfer-deepseek-v4/src/runtime/moe_pplx.rs`(flat layout,无 `mod.rs`):
- `decode_moe_pplx_bf16_hidden_with_scratch(ctx, config, weights, ptr_cache, ep, layer, input, token_ids, scratches)`
- 顺序大致:`dispatch_send` → 同流跑 shared expert → `dispatch_recv` → grouped FP4 GEMM → `combine_send` → 同流跑后续 layer 准备 → `combine_recv` 写回 hidden。
@@ -125,7 +125,7 @@ pplx 路径的 send/recv buffer 必须**预注册到 fabric-lib 的 MR**,不
### 初始化位置
-`pegainfer-deepseek-v4/src/direct.rs` 里 `RankWorker::spawn` 阶段,跟 NCCL `Comm` 同级:
+`openinfer-deepseek-v4/src/direct.rs` 里 `RankWorker::spawn` 阶段,跟 NCCL `Comm` 同级:
```
RankWorker::spawn
@@ -142,7 +142,7 @@ RankWorker::spawn
### 运行时切换
-- 编译期:`pegainfer-comm` 的 `hw-rdma` feature 已经存在;dsv4 加一个 `pplx-ep` feature,关掉时 `moe_pplx.rs` 整个 `cfg`-out,不拉 fabric-lib 依赖。
+- 编译期:`openinfer-comm` 的 `hw-rdma` feature 已经存在;dsv4 加一个 `pplx-ep` feature,关掉时 `moe_pplx.rs` 整个 `cfg`-out,不拉 fabric-lib 依赖。
- 运行时:`Config` 加 `moe_backend: MoeBackend { Nccl, Pplx }`,CLI `--moe-backend nccl|pplx`,默认 `nccl`。选 `pplx` 时:
- `pplx-ep` feature 必须开,否则启动报错。
- decode CUDA Graph 自动关闭(不需要用户单独传参)。
@@ -153,7 +153,7 @@ RankWorker::spawn
确认 scratch/buffer 形态、初始化位置、CLI 入口。
-### Step 1:pegainfer-comm 去 skeleton,砍 trait
+### Step 1:openinfer-comm 去 skeleton,砍 trait
- 删 `EpAllToAll` trait 与 `Box`,`EpBackend` 改 concrete + inherent 四步方法。
- 改 `EpBackendBuilder::build()`:`hw-rdma` 分支真正构造 `AllToAllContext`。
@@ -162,7 +162,7 @@ RankWorker::spawn
### Step 2:dsv4 加 pplx 路径
-- `pegainfer-deepseek-v4/src/runtime/moe_pplx.rs` 写 `decode_moe_pplx_bf16_hidden_with_scratch`,路由 / grouped GEMM / shared expert 复用现有 helper,只把 AG/RS 替换成 dispatch_send→recv + combine_send→recv。
+- `openinfer-deepseek-v4/src/runtime/moe_pplx.rs` 写 `decode_moe_pplx_bf16_hidden_with_scratch`,路由 / grouped GEMM / shared expert 复用现有 helper,只把 AG/RS 替换成 dispatch_send→recv + combine_send→recv。
- 新增 `MoePplxScratch`,跟 `MoeAgRsScratch` 同级,按 `cfg(feature="pplx-ep")` 在 `RankWorker` 里二选一持有。
- `RankWorker` MoE 调用点加 `if let Some(ep) = &self.ep { ... } else { 现有 NCCL 路径 }`。
- `Config` / CLI 加 `--moe-backend`,pplx 时强制关 decode CUDA Graph。
@@ -201,14 +201,14 @@ RankWorker::spawn
## 当前进度(2026-05-16)
**已落地**
-- `pegainfer-comm` 去 skeleton + 砍 trait:`EpBackend` 是 concrete struct,四步 `dispatch_send / dispatch_recv / combine_send / combine_recv` 是 inherent method,构造走 `EpBackend::new(EpBackendParams)`,外加 `tokens_per_expert_ptr()` 让下游 grouped GEMM 拿 per-expert 计数。`unsafe impl Send` 让 EpBackend 可以从外部线程移交进 RankWorker。
+- `openinfer-comm` 去 skeleton + 砍 trait:`EpBackend` 是 concrete struct,四步 `dispatch_send / dispatch_recv / combine_send / combine_recv` 是 inherent method,构造走 `EpBackend::new(EpBackendParams)`,外加 `tokens_per_expert_ptr()` 让下游 grouped GEMM 拿 per-expert 计数。`unsafe impl Send` 让 EpBackend 可以从外部线程移交进 RankWorker。
- **砍掉 LibTorch 依赖**:a2a-kernels 自己定义 cxx `ScalarType` enum(namespace 改 `a2a_kernels::`),a2a-kernels 与 p2p-all-to-all 的 `torch-lib` dep 全部移除。`hw-rdma` feature 现在只需要 CUDA + RDMA Verbs + GDRCopy,不再拉 LibTorch / pyo3。
-- dsv4 加 `pplx-ep` feature,optional 依赖 `pegainfer-comm/hw-rdma`。
+- dsv4 加 `pplx-ep` feature,optional 依赖 `openinfer-comm/hw-rdma`。
- `runtime/moe_pplx.rs`:`decode_moe_pplx_bf16_hidden_with_scratch` body 完整——本地 route → dispatch_send → shared expert(overlap) → dispatch_recv → host 端 prefix-sum 出 expert_indptr → grouped FP4 expert → combine_send → combine_recv(`accumulate=true`,把 shared expert 折进 routed 输出)。
- `state.rs` 加 `MoePplxScratch`(MR-recv buffer 还是占位,要在 bootstrap 阶段注册)+ `MoeRunContext` / `MoePplxRunContext` 把两条 MoE 路径统一成一个参数。
- `block_decode_rank_lane_bf16_hidden_with_scratch`(含 batch 变体)签名改成 `moe: &mut MoeRunContext<'_>`,内部 `dispatch_decode_moe_step` 按 `moe.pplx.is_some()` 分发到两条路径。
- `RankWorker` 新增 `RankCommand::EnablePplx { ep_backend }`;`DeepSeekV4DirectGenerator::enable_pplx(Vec)` 把 per-rank 后端塞进对应 worker。
-- `cargo check -p pegainfer-comm` 通过。dsv4 因为 pegainfer-kernels 在本机 CUDA/flashinfer SDK 缺失编译不了(pre-existing),结构性 review 看 diff。
+- `cargo check -p openinfer-comm` 通过。dsv4 因为 openinfer-kernels 在本机 CUDA/flashinfer SDK 缺失编译不了(pre-existing),结构性 review 看 diff。
- 修通 H200 decode 全链路的几次硬伤:
- **per-rank TransferEngine**:每张卡绑自己的 CX-7 NIC,`AllToAllRankHandle` 才能带上 peer 自己的 NIC `main_address`。早期共享 TE 时所有 RankHandle 都指向 worker[0],触发 RDMA `LOC_PROT_ERR`。
- **`num_dp_groups = world_size / dp_size`**(纯 EP 下 = world_size):之前硬编码 1 让 `num_routed[N*num_experts]` 越界写。
@@ -220,7 +220,7 @@ RankWorker::spawn
- `PplxRankResources.peer_mappings` 接管 peer CUMem `CUMemMapping` 的生命周期,不再 `Box::leak`。
**Functional baseline(commit `0abe8fa`)**
-- 命令:`PEGAINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 ./target/release/bench_serving --model-path $MODEL_DIR request --prompt-len 1 --output-len 4 --warmup 0 --iters 1`
+- 命令:`OPENINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 ./target/release/bench_serving --model-path $MODEL_DIR request --prompt-len 1 --output-len 4 --warmup 0 --iters 1`
- 结果:prefill 521 ms,first decode step 7331 ms,steady TPOT **6900 ms / tok**(0.14 tok/s)。
- NCCL 对照(同机 H200):steady TPOT **63.77 ms / tok**(15.69 tok/s)。
- 退出时 a2a_context worker shutdown 路径有 segfault,不影响前向;后续清理。
@@ -229,15 +229,15 @@ RankWorker::spawn
- Per-rank TransferEngine / NIC 绑定修复后,Verbs `LOCAL_PROTECTION_ERROR` 消失,说明 MR/lkey/NIC ownership 的大方向已经对齐。
- `dispatch_recv` / `combine_recv` 的 host-visible done flag 改成最后发布:先 reset kernel/worker 共享状态,再 `fence_release_system()`,最后 `st_mmio_b8(*_done, 1)`。这修的是 worker 看到 done 后推进下一步、而上一轮 kernel 尾部又清 flag 的 timing race。
- `dispatch_recv` 额外修了 single-node 常见的 `num_fabric_tokens == 0` completion path:只允许 block 0 发布完成;有 fabric tokens 时只让 `num_local_tokens > 0` 的 block 参与 `grid_counter`。否则空 block 也可能满足 completion 条件。
-- H200 短跑命令:`PEGAINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 RUST_BACKTRACE=1 ./target/release/bench_serving --model-path $MODEL_DIR request --prompt-len 1 --output-len 2 --warmup 0 --iters 1`
+- H200 短跑命令:`OPENINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 RUST_BACKTRACE=1 ./target/release/bench_serving --model-path $MODEL_DIR request --prompt-len 1 --output-len 2 --warmup 0 --iters 1`
- H200 短跑结果:完成,`prefill_ms=534.96`,`first_decode_step_ms=1487.85`,`e2e_ms=2023.10`,`decode_tok_s=0.67`;日志 `$RESULT_ROOT/pplx_after_flag_order.log`。这是 correctness signal,不作为新的 TPOT baseline。
**GPU expert-indptr update**
- `deepseek_pplx_padded_expert_indptr_cuda` 新增为 1-block helper kernel:读取 `dispatch_recv` 写出的 `recv_tokens_per_expert[local_experts]`,按 pplx `expert_padding` 生成 padded `expert_indptr[local_experts + 1]`。
- `moe_pplx.rs` 删除每层 `moe_stream.synchronize()`、D2H `recv_tokens_per_expert`、CPU prefix sum、H2D `expert_indptr`,改为 `dispatch_recv -> device prefix -> event -> grouped GEMM`。
- 第一版为了不让 host 读动态 padded count,grouped FP4 GEMM 的 host `rows` 使用 `expanded_input.seq_capacity()`;真实 expert 范围仍由 device `expert_indptr` 控制,`combine_send` 仍从 pplx worker 的 device `num_recv_tokens` 读真实 token 数。这个版本优先消掉同步闭环,后续可再把 dynamic rows 也留在 GPU 侧。
-- Local validation: `rustfmt --edition 2024 --check pegainfer-kernels/src/ffi.rs pegainfer-deepseek-v4/src/runtime/moe_pplx.rs pegainfer-deepseek-v4/src/runtime/state.rs` passed; `cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep` passed.
-- H200 validation: `cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep` passed; `cargo build --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed.
+- Local validation: `rustfmt --edition 2024 --check openinfer-kernels/src/ffi.rs openinfer-deepseek-v4/src/runtime/moe_pplx.rs openinfer-deepseek-v4/src/runtime/state.rs` passed; `cargo check --release -p openinfer-deepseek-v4 --features pplx-ep` passed.
+- H200 validation: `cargo check --release -p openinfer-deepseek-v4 --features pplx-ep` passed; `cargo build --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed.
- H200 `output_len=2`: `prefill_ms=519.38`, `first_decode_ms=191.55`, `e2e_ms=711.36`, `decode_tok_s=5.22`; log `$RESULT_ROOT/pplx_gpu_indptr_olen2.log`.
- H200 `output_len=4`: `prefill_ms=488.55`, `first_decode_ms=199.41`, steady TPOT **123.91 ms/tok**, `e2e_ms=936.22`, `decode_tok_s=6.71`; log `$RESULT_ROOT/pplx_gpu_indptr_olen4.log`.
- Negative experiment: after GPU indptr, reading back only `expert_indptr[local_experts]` per MoE layer to pass exact dynamic `rows` into grouped FP4 regressed H200 `output_len=4` to `first_decode_ms=1451.94`, steady TPOT **1544.45 ms/tok**, `e2e_ms=5040.78`; log `$RESULT_ROOT/pplx_dynamic_rows_olen4.log`. The one-scalar host wait is still far more expensive than running grouped FP4 over scratch capacity, so this change was reverted. Dynamic rows only make sense through a GPU-only launch strategy or a custom kernel wrapper.
@@ -283,9 +283,9 @@ RankWorker::spawn
- 结论:当前“MoE 抖”更像 MoE 前置算子/launch/rank 到达方差被 pplx worker range 放大显示,而不是 pplx 四段 kernel 平均时间单独决定。
- 2026-05-16 临时 HC mix bypass 实验(未保留代码)把 `seq_len=1` 且无 raw/rms side output 的 `deepseek_hc_mixes_cuda` 从 BF16->F32 + cuBLAS `Sgemv` + scale kernel 改成已有 `deepseek_hc_mixes_kernel`。H200 `output_len=8` 短测 steady TPOT **149.96 ms/tok**(p50 **127.93 ms**,max **232.07 ms**),nsys event profile steady TPOT **133.31 ms/tok**。profile 证实 cuBLAS GEMV 基本消失,`deepseek_hc_mixes_kernel` 66 次 total **1.77 ms**,但整体仍由 NCCL all-reduce、FP4 grouped GEMM 和 CUDA API/launch 长尾主导;因此该改法不作为主线保留。
- 2026-05-16 加入一次性 NVTX probe(只随 `pplx-ep` feature 编译):request 主线程标出 `dsv4.request.prefill / step / sample / emit_token / advance_decode`;runtime 标出 `dsv4.runtime.dispatch_rank_decode / wait_rank_decode / rank0_logits`;rank worker 标出 `dsv4.rank.decode / token_upload / embedding / embedding_all_reduce / hc_expand / decode_layer / final_logits / gather_logits / logits_all_gather / logits_dtoh`;layer 内标出 `dsv4.layer.attn_hc_pre_norm / attention / attention_full / attention_ratio4 / attention_compressed / ffn_hc_pre_norm / moe / ffn_hc_post`。现有 pplx worker range(`p2p_all_to_all / dispatch / combine / process_routing_info / barrier`)保留。这样同一条 nsys timeline 能直接判断 steady 60ms gap 是 sampling/logits、rank response wait、MoE 前 operator 到达、还是 pplx worker protocol。
-- Validation for the probe: local `cargo fmt --check -p pegainfer-deepseek-v4` passed; local `cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep` passed; H200 `cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep` passed after syncing the instrumented files and restoring local `deepseek_hc.cu` on the remote tree. The remaining warnings are the pre-existing pplx visibility/unused warnings.
-- 2026-05-16 ratio128 compressed decode scratch 化:单 token non-overlap compressed attention 改为复用 `AttentionProjectionScratch / AttentionIndexScratch / AttentionAuxScratch / AttentionOutputScratch`,新增 `compressor_nonoverlap_decode_bf16_hidden_at_scratch` 与 `compress_topk_indices_decode_into`,删除旧 owned-return 单 token入口。Local validation: `cargo fmt --check -p pegainfer-deepseek-v4` passed, `cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep` passed, `git diff --check` passed. H200 validation: `cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep` passed, `cargo build --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 `output_len=8` smoke completed with steady TPOT **167.95 ms/tok** (`$RESULT_ROOT/pplx_ratio128_scratch_olen8.log`), and node NVTX profile completed with steady TPOT **147.31 ms/tok** (`$RESULT_ROOT/pplx_ratio128_scratch_nvtx_olen8.{log,nsys-rep,sqlite}`). The new sqlite confirms decode-window `cuMemAllocAsync` and `cuMemFreeAsync` are both **0**; previous probe had `cuMemAllocAsync=11200` all attributed to `dsv4.layer.attention_compressed`. TPOT did not materially improve, so allocator spikes were real but not the current wall-clock root cause.
-- 2026-05-16 ratio4 topk refactor correction:第一次实现打到 single-token ratio4 helper,但 H200 profile 和源码路径确认 decode 走的是 `attention_decode_compressed_overlap_rank_local_collective_bf16_hidden_batch_with_scratch`,即 `bs=1` 也走 batch helper。随后删除 single-token fused helper、删除 dead batch `indexer_topk_indices_decode_batch_into` wrapper,在 batch path 中把 `window_topk + indexer_topk + concat` 合为 `deepseek_ratio4_decode_topk_indices_batch_kernel`;`max_compressed_len == 0` 仍走 window-only path。Local validation: `cargo fmt --check -p pegainfer-deepseek-v4 -p pegainfer-kernels` passed, `cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep` passed, `git diff --check` passed. H200 validation: `cargo build --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; bench binary fatbin contains `_Z48deepseek_ratio4_decode_topk_indices_batch_kernel...`, and remote source calls `ratio4_decode_topk_indices_batch_into` from `runtime/block.rs`.
+- Validation for the probe: local `cargo fmt --check -p openinfer-deepseek-v4` passed; local `cargo check --release -p openinfer-deepseek-v4 --features pplx-ep` passed; H200 `cargo check --release -p openinfer-deepseek-v4 --features pplx-ep` passed after syncing the instrumented files and restoring local `deepseek_hc.cu` on the remote tree. The remaining warnings are the pre-existing pplx visibility/unused warnings.
+- 2026-05-16 ratio128 compressed decode scratch 化:单 token non-overlap compressed attention 改为复用 `AttentionProjectionScratch / AttentionIndexScratch / AttentionAuxScratch / AttentionOutputScratch`,新增 `compressor_nonoverlap_decode_bf16_hidden_at_scratch` 与 `compress_topk_indices_decode_into`,删除旧 owned-return 单 token入口。Local validation: `cargo fmt --check -p openinfer-deepseek-v4` passed, `cargo check --release -p openinfer-deepseek-v4 --features pplx-ep` passed, `git diff --check` passed. H200 validation: `cargo check --release -p openinfer-deepseek-v4 --features pplx-ep` passed, `cargo build --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 `output_len=8` smoke completed with steady TPOT **167.95 ms/tok** (`$RESULT_ROOT/pplx_ratio128_scratch_olen8.log`), and node NVTX profile completed with steady TPOT **147.31 ms/tok** (`$RESULT_ROOT/pplx_ratio128_scratch_nvtx_olen8.{log,nsys-rep,sqlite}`). The new sqlite confirms decode-window `cuMemAllocAsync` and `cuMemFreeAsync` are both **0**; previous probe had `cuMemAllocAsync=11200` all attributed to `dsv4.layer.attention_compressed`. TPOT did not materially improve, so allocator spikes were real but not the current wall-clock root cause.
+- 2026-05-16 ratio4 topk refactor correction:第一次实现打到 single-token ratio4 helper,但 H200 profile 和源码路径确认 decode 走的是 `attention_decode_compressed_overlap_rank_local_collective_bf16_hidden_batch_with_scratch`,即 `bs=1` 也走 batch helper。随后删除 single-token fused helper、删除 dead batch `indexer_topk_indices_decode_batch_into` wrapper,在 batch path 中把 `window_topk + indexer_topk + concat` 合为 `deepseek_ratio4_decode_topk_indices_batch_kernel`;`max_compressed_len == 0` 仍走 window-only path。Local validation: `cargo fmt --check -p openinfer-deepseek-v4 -p openinfer-kernels` passed, `cargo check --release -p openinfer-deepseek-v4 --features pplx-ep` passed, `git diff --check` passed. H200 validation: `cargo build --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; bench binary fatbin contains `_Z48deepseek_ratio4_decode_topk_indices_batch_kernel...`, and remote source calls `ratio4_decode_topk_indices_batch_into` from `runtime/block.rs`.
- 2026-05-16 ratio4 batch topk profile:H200 `output_len=16` nsys report `$RESULT_ROOT/pplx_ratio4_batch_topk_nvtx_olen16.{log,nsys-rep,sqlite}` completed with benchmark steady TPOT avg **152.85 ms**, p50 **144.03 ms**, p95 **188.01 ms**, max **196.06 ms**. Compared with `$RESULT_ROOT/pplx_ratio4_refactor_nvtx_olen16.sqlite`, NVTX distributions improved in the specific attention range: `dsv4.layer.attention_ratio4` avg **2.178 -> 0.827 ms**, p95 **15.746 -> 1.593 ms**, max **42.748 -> 34.788 ms**; `dsv4.rank.decode` p50 **114.779 -> 79.695 ms**. Request-level `dsv4.runtime.wait_rank_decode` p50 only moved **175.804 -> 143.788 ms** and still dominates, so this change removes real ratio4 launch fanout but does not complete the TPOT target by itself. Nsight kernel table captured only one device, so kernel-name absence/presence in the sqlite is not a sufficient proof source; source path + fatbin symbol + NVTX movement are the useful evidence.
- 2026-05-17 decode-only driver-contention profile:temporary `cudaProfilerStart/Stop` was hard-coded around decode and nsys was run with `--capture-range=cudaProfilerApi --sample=process-tree --sampling-period=1000000 --cpuctxsw=process-tree --cudabacktrace=all:1000 --cuda-flush-interval=100 --osrt-threshold=1000 --stats=true`; remote report `$RESULT_ROOT/pplx_driver_contention_olen8.{log,nsys-rep,sqlite}`. That profiler API patch was removed during cleanup; this entry is a historical capture record, not the current reusable profiling command. The capture fixed the previous truncated-kernel profile: every device has **17787** kernels and D2H device memcpy time is only **93 us** total across 7 copies. CUDA API summary shows long host-side tails instead:
- `cudaLaunchCooperativeKernel`: 9632 calls, total **1287.95 ms**, max **168.34 ms**; stack is `cudaLaunchCooperativeKernel -> a2a_dispatch_send -> EpBackend::dispatch_send -> decode_moe_pplx...`.
@@ -295,10 +295,10 @@ RankWorker::spawn
- `cuKernelSetAttribute` has three **20.3-20.7 ms** tails inside `attention_ratio4 step=3 layer=2`, matching the driver-state hypothesis. Since the captured callchains for these rows are absent in Nsight, the current attribution is by NVTX containment and correlation timing rather than Rust stack.
- CPU sampling marks rank threads as `Running` during samples, so the rank workers are not simply sleeping in application code during the measured decode; the expensive sleeps in OSRT are mostly channel waits outside hot work or libcuda internal mutex/futex behavior during launch.
This profile changes the interpretation of ratio4/HC spikes: tune kernel arithmetic only after checking API-vs-GPU duration. The immediate optimization axis is launch count / module lookup / attribute setup churn in the decode path, while MoE/NCCL communication remains excluded from the current optimization target.
-- 2026-05-17 upstream `pegainfer-comm/benchmarks/bench_all_to_all.py` was run on H200 with dsv4-shaped payloads to get the pplx-side theoretical MoE A2A floor. The script now exposes `--expert-padding` so it can match Rust `PplxBootstrapParams::default().expert_padding = 16`; default upstream padding remains 1. Common command shape:
+- 2026-05-17 upstream `openinfer-comm/benchmarks/bench_all_to_all.py` was run on H200 with dsv4-shaped payloads to get the pplx-side theoretical MoE A2A floor. The script now exposes `--expert-padding` so it can match Rust `PplxBootstrapParams::default().expert_padding = 16`; default upstream padding remains 1. Common command shape:
```bash
-cd $PEGAINFER_DIR-comm
+cd $OPENINFER_DIR-comm
NCCL_NVLS_ENABLE=0 PYTHONPATH=. ../.venv/bin/python benchmarks/bench_all_to_all.py \
--world-size 8 --dp-size 1 --nets-per-gpu 1 \
--max-private-tokens 64 \
@@ -316,15 +316,15 @@ The two payload points:
| Real bs=1 decode: `max_num_tokens=1`, total EP routes = `8 * 1 * 6 = 48` | `$RESULT_ROOT/dsv4_pplx_a2a_max1_pad16.{log,json}` | **63.87 us** | **21.54 us** | **85.41 us/layer** | **52.64 us/layer** |
| Rust bootstrap capacity: `max_num_tokens=8`, total EP routes = `8 * 8 * 6 = 384` | `$RESULT_ROOT/dsv4_pplx_a2a_max8_pad16.{log,json}` | **63.71 us** | **21.79 us** | **85.50 us/layer** | **53.82 us/layer** |
-Interpretation: the upstream multi-process pplx benchmark reports the GPU/protocol A2A floor for our BF16/topk6/EP8 payload as roughly **0.085 ms per MoE layer**, or about **3.7 ms/token** across 43 layers if dispatch+combine are serialized. This is orders below the 140-160 ms request TPOT class, so the current gap is not explained by raw payload movement. This benchmark does not include pegainfer's full model runtime, explicit stream handoffs, model-side operator launch fanout, grouped GEMM/attention/NCCL, or request wait-rank effects; use it as a theoretical lower bound, not as an end-to-end replacement profile.
+Interpretation: the upstream multi-process pplx benchmark reports the GPU/protocol A2A floor for our BF16/topk6/EP8 payload as roughly **0.085 ms per MoE layer**, or about **3.7 ms/token** across 43 layers if dispatch+combine are serialized. This is orders below the 140-160 ms request TPOT class, so the current gap is not explained by raw payload movement. This benchmark does not include openinfer's full model runtime, explicit stream handoffs, model-side operator launch fanout, grouped GEMM/attention/NCCL, or request wait-rank effects; use it as a theoretical lower bound, not as an end-to-end replacement profile.
- 2026-05-17 added `deepseek_pplx_a2a_bench`, a Rust-side microbench that reuses the same dsv4 `build_intra_node_backends_for_devices` wrapper and `EpBackend` methods but excludes all model operators. It allocates BF16 hidden/out buffers on each rank, uses synthetic balanced routes to the next 6 ranks (`topk=6`), and reports both flattened rank×iteration stage times and per-iteration max across the 8 ranks. This isolates the single-process Rust wrapper / bootstrap / stream handoff layer between the upstream Python benchmark and the full dsv4 runtime.
Command shape:
```bash
-cd $PEGAINFER_DIR
+cd $OPENINFER_DIR
PATH=$CARGO_BIN_DIR:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/bin:$PATH \
- cargo build --release -p pegainfer-deepseek-v4 --features pplx-ep-bench \
+ cargo build --release -p openinfer-deepseek-v4 --features pplx-ep-bench \
--bin deepseek_pplx_a2a_bench
./target/release/deepseek_pplx_a2a_bench \
--model-path $MODEL_DIR \
@@ -346,13 +346,13 @@ Interpretation: the Rust single-process wrapper still keeps pplx A2A at roughly
### Per-NUMA Slice Placement Validation
-H200 validation after moving CPU topology helpers to `pegainfer_core::cpu_topology`:
+H200 validation after moving CPU topology helpers to `openinfer_core::cpu_topology`:
- Local build gates passed:
- - `cargo test --release -p pegainfer-core cpu_topology -- --nocapture`
- - `cargo fmt --check -p pegainfer-core -p pegainfer-comm-fabric-lib -p pegainfer-deepseek-v4`
- - `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep-bench --bin deepseek_pplx_a2a_bench`
- - `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving`
+ - `cargo test --release -p openinfer-core cpu_topology -- --nocapture`
+ - `cargo fmt --check -p openinfer-core -p openinfer-comm-fabric-lib -p openinfer-deepseek-v4`
+ - `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-deepseek-v4 --features pplx-ep-bench --bin deepseek_pplx_a2a_bench`
+ - `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving`
- H200 release builds passed for `bench_serving` and `deepseek_pplx_a2a_bench`.
- Startup placement now reserves CPU0 and CPU1, then slices each NUMA node by rank:
- NUMA0 ranks 0-3 use even CPU slices: rank0 `2..46`, rank1 `48..94`, rank2 `96..142`, rank3 `144..190`.
@@ -414,7 +414,7 @@ Profile file: `$RESULT_ROOT/pplx_moe_stage_nvtx_olen16.sqlite` on `jzh200-11`.
Command shape:
```bash
-PEGAINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 nsys profile \
+OPENINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 nsys profile \
--force-overwrite=true --trace=nvtx --sample=none --stats=false \
-o $RESULT_ROOT/pplx_moe_stage_nvtx_olen16 \
./target/release/bench_serving --model-path $MODEL_DIR \
@@ -465,7 +465,7 @@ Profile file: `$RESULT_ROOT/pplx_moe_stage_cuda_capture_olen16.sqlite` on `jzh20
Historical command shape while the temporary profiler API patch was present:
```bash
-PEGAINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 nsys profile \
+OPENINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 nsys profile \
--force-overwrite=true --trace=cuda,nvtx --sample=none \
--capture-range=cudaProfilerApi --capture-range-end=stop-shutdown \
--cuda-flush-interval=100 --stats=false \
@@ -534,7 +534,7 @@ Profile file: `$RESULT_ROOT/pplx_driver_contention_olen8.sqlite` on `jzh200-11`.
Historical command shape while the temporary profiler API patch was present:
```bash
-PEGAINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 nsys profile \
+OPENINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 nsys profile \
--trace=cuda,nvtx,osrt \
--sample=process-tree --sampling-period=1000000 \
--cpuctxsw=process-tree \
@@ -704,21 +704,21 @@ Current pplx worker CPU-pool separation experiment:
- **Hypothesis**: if part of the steady tail comes from host progress jitter, then avoiding exact CPU overlap between DeepSeek rank workers and pplx a2a workers will reduce p95/max while leaving p50 roughly unchanged.
- **Ceiling estimate**: the new clean log shows an exact conflict: DeepSeek rank worker 3 is pinned to CPU **6**, and pplx a2a worker for cuda:0 is also pinned to CPU **6**. Earlier profiles showed p95/max dominated by host progress/driver wait, so the plausible benefit is tail reduction rather than a 60-80 ms p50 win.
- **Keep/revert criterion**: keep only if local/H200 build passes, H200 logs show no rank-worker/a2a-worker exact CPU overlap, `output_len=64` smoke generates all 64 tokens, and p95 improves by >=20 ms with p50 <= baseline + 5 ms. Revert if p50 regresses beyond 5 ms, p95/max worsens, or affinity selection fails on a constrained CPU mask.
-- **Result**: kept. Local `cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed, H200 `cargo build --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed, and two H200 `output_len=64` serving runs completed all 64 tokens. The new logs show rank workers on CPUs **0/2/4/6/9/11/13/15** and pplx a2a workers on **8/26/50/74/3/27/51/75**, with no exact overlap. Results: `$RESULT_ROOT/pplx_cpu_pool_olen64.log` measured p50 **144.00 ms**, p95 **159.96 ms**, max **164.00 ms**, avg **144.06 ms**; `$RESULT_ROOT/pplx_cpu_pool_olen64_r2.log` measured p50 **144.00 ms**, p95 **162.86 ms**, max **168.01 ms**, avg **145.03 ms**. This passes the tail gate and reduces average TPOT by ~15-16 ms versus `$RESULT_ROOT/pplx_current_olen64.log`; it does not move p50, so the remaining gap is not CPU-overlap tail. Residual risk: the second run printed a teardown-time NCCL abort panic after metrics were emitted; this matches the known shutdown-path instability and is not forward-path evidence.
+- **Result**: kept. Local `cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed, H200 `cargo build --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed, and two H200 `output_len=64` serving runs completed all 64 tokens. The new logs show rank workers on CPUs **0/2/4/6/9/11/13/15** and pplx a2a workers on **8/26/50/74/3/27/51/75**, with no exact overlap. Results: `$RESULT_ROOT/pplx_cpu_pool_olen64.log` measured p50 **144.00 ms**, p95 **159.96 ms**, max **164.00 ms**, avg **144.06 ms**; `$RESULT_ROOT/pplx_cpu_pool_olen64_r2.log` measured p50 **144.00 ms**, p95 **162.86 ms**, max **168.01 ms**, avg **145.03 ms**. This passes the tail gate and reduces average TPOT by ~15-16 ms versus `$RESULT_ROOT/pplx_current_olen64.log`; it does not move p50, so the remaining gap is not CPU-overlap tail. Residual risk: the second run printed a teardown-time NCCL abort panic after metrics were emitted; this matches the known shutdown-path instability and is not forward-path evidence.
Current intra-process route exchange experiment:
- **Target metric**: H200 EP8 `output_len=64` serving p50 should improve by at least **10 ms** over the CPU-pool baseline p50 **144.00 ms**, with p95 staying <= **165 ms** and all 64 tokens generated.
- **Hypothesis**: if p50 floor still includes per-layer fabric route all-gather overhead, then in the single-process single-node case replacing `route_write_op + route_counter.wait` with a process-local barrier plus direct reads of peer `num_routed` mapped host pointers will reduce p50. This should not change dispatch/combine payload semantics.
- **Ceiling estimate**: every MoE layer currently performs route exchange before `process_routing_info`, even though all 8 rank workers live in one process and each rank's `num_routed_host` pointer is directly addressable. The ceiling is one worker transfer submission + immediate wait per layer, so a plausible win is **5-15 ms** p50; it will not close the full 80 ms gap alone.
- **Keep/revert criterion**: keep only if local/H200 builds pass, H200 `output_len=64` completes with all tokens, p50 improves by >=10 ms, and p95 stays <=165 ms. Revert on hang, correctness error, p50 regression, or teardown/stop deadlock.
-- **Result**: reverted. Local `cargo fmt -p pegainfer-comm -p pegainfer-deepseek-v4` and `cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed, H200 build passed, and the short H200 smoke `$RESULT_ROOT/pplx_direct_route_olen8.log` completed with steady avg **141.96 ms**, p50 **143.98 ms**, p95/max **159.08 ms**. The full gate run `$RESULT_ROOT/pplx_direct_route_olen64.log` completed all 64 tokens with first decode **223.66 ms**, steady avg **142.64 ms**, p50 **144.00 ms**, p95 **155.95 ms**, max **160.04 ms**. p95 stayed good but p50 did not move by the required 10 ms, so the code was removed. Post-revert H200 smoke `$RESULT_ROOT/pplx_cpu_pool_restored_olen8.log` completed all 8 tokens with p50 **151.81 ms** and p95/max **164.01 ms**; this is a short correctness smoke, not a new baseline. Mechanism lesson: removing only the route all-gather submission/wait is too small or already overlapped; the p50 floor is in larger per-layer a2a state-machine/device wait work, not this specific route exchange.
+- **Result**: reverted. Local `cargo fmt -p openinfer-comm -p openinfer-deepseek-v4` and `cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed, H200 build passed, and the short H200 smoke `$RESULT_ROOT/pplx_direct_route_olen8.log` completed with steady avg **141.96 ms**, p50 **143.98 ms**, p95/max **159.08 ms**. The full gate run `$RESULT_ROOT/pplx_direct_route_olen64.log` completed all 64 tokens with first decode **223.66 ms**, steady avg **142.64 ms**, p50 **144.00 ms**, p95 **155.95 ms**, max **160.04 ms**. p95 stayed good but p50 did not move by the required 10 ms, so the code was removed. Post-revert H200 smoke `$RESULT_ROOT/pplx_cpu_pool_restored_olen8.log` completed all 8 tokens with p50 **151.81 ms** and p95/max **164.01 ms**; this is a short correctness smoke, not a new baseline. Mechanism lesson: removing only the route all-gather submission/wait is too small or already overlapped; the p50 floor is in larger per-layer a2a state-machine/device wait work, not this specific route exchange.
Current bs=1 pplx capacity clamp experiment:
- **Target metric**: H200 EP8 bs=1 `output_len=64` serving p50 should improve by at least **15 ms** over CPU-pool baseline p50 **144.00 ms**, with p95 <= **165 ms** and all 64 tokens generated.
- **Hypothesis**: if a meaningful part of the 144 ms p50 floor is grouped FP4 work over unused pplx scratch capacity, then clamping pplx decode buffers to the actual bs=1 validation envelope will lower `expanded_input.seq_capacity()` and the grouped FP4 `rows` launch bound enough to move p50. This targets the current GPU-only rows issue without reintroducing per-layer host readback.
- **Ceiling estimate**: current default `max_num_tokens=8` plus upstream private-token formula gives `max_recv_tokens=1376` rows for H200 EP8 (`topk=6`, local experts=32, padding=16). For bs=1, setting `max_num_tokens=1` and `max_private_tokens=topk` gives `max_recv_tokens=560`, a **59%** reduction in the W1/W3 and W2 grouped FP4 row bound. If grouped capacity work is a large part of p50, expected win is **15-30 ms**; if p50 is dominated by a2a state-machine waits, p50 will stay near 144 ms.
- **Keep/revert criterion**: keep only if local/H200 builds pass, H200 `output_len=64` generates all 64 tokens, p50 improves by >=15 ms, and p95 stays <=165 ms. Revert on capacity error, illegal address, correctness-looking output failure, p50 regression, or p95 regression. This experiment is explicitly bs=1; it does not claim batch-serving support.
-- **Result**: reverted. Local `cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed and H200 build passed. H200 short smoke `$RESULT_ROOT/pplx_bs1_capacity_olen8.log` completed all 8 tokens with first decode **198.61 ms**, steady avg **143.97 ms**, p50 **144.00 ms**, p95/max **156.02 ms**. Full gate `$RESULT_ROOT/pplx_bs1_capacity_olen64.log` completed all 64 tokens with first decode **199.95 ms**, steady avg **143.24 ms**, p50 **144.00 ms**, p95 **155.74 ms**, max **192.25 ms**. The p50 did not move, so the 59% row-bound reduction is not the missing 80 ms. Mechanism lesson: the grouped FP4 capacity overrun may affect small averages, but the p50 floor is dominated by a fixed per-token/per-layer synchronization or worker-state cost outside grouped rows.
+- **Result**: reverted. Local `cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed and H200 build passed. H200 short smoke `$RESULT_ROOT/pplx_bs1_capacity_olen8.log` completed all 8 tokens with first decode **198.61 ms**, steady avg **143.97 ms**, p50 **144.00 ms**, p95/max **156.02 ms**. Full gate `$RESULT_ROOT/pplx_bs1_capacity_olen64.log` completed all 64 tokens with first decode **199.95 ms**, steady avg **143.24 ms**, p50 **144.00 ms**, p95 **155.74 ms**, max **192.25 ms**. The p50 did not move, so the 59% row-bound reduction is not the missing 80 ms. Mechanism lesson: the grouped FP4 capacity overrun may affect small averages, but the p50 floor is dominated by a fixed per-token/per-layer synchronization or worker-state cost outside grouped rows.
Current local CUDA Graph island experiment:
- **Target metric**: H200 EP8 bs=1 pplx decode steady TPOT p50 reduced by at least **10 ms** over at least **16 decode steps**; no correctness regression or CUDA graph capture failure.
@@ -727,13 +727,13 @@ Current local CUDA Graph island experiment:
- **Keep/revert criterion**: keep only if local build passes, H200 build passes, bs=1 correctness smoke completes, and either steady p50 improves by >=10 ms or decode-only profile proves a large API-call reduction without new instability. Revert if graph capture fails, graph replay produces stale-token behavior, or p50 moves less than 5 ms with no clear API reduction.
- **Result**: local build passed, H200 build passed, and bs=1 smoke completed. H200 `output_len=24` non-profile run completed with `prefill_ms=576.78`, `first_decode_step_ms=491.85`, steady TPOT avg **159.81 ms**, p50 **154.78 ms**, p95 **214.96 ms**, max **292.08 ms**, samples **22** (`$RESULT_ROOT/pplx_local_graph_olen24.log`). This is worse than the prior ratio4 batch topk profile class (`output_len=16`, avg **152.85 ms**, p50 **144.03 ms**, p95 **188.01 ms**), so the wall-clock gate failed.
- **Profile result**: decode-only nsys profile `$RESULT_ROOT/pplx_local_graph_profile_olen12.{log,nsys-rep,sqlite}` captured 11 request steps / 88 rank decode ranges. It recorded **1056** `cuGraphInstantiateWithFlags` calls totaling **944.08 ms**, exactly matching 132 graph islands × 8 ranks. Replay added **23232** `cuGraphLaunch` calls totaling **421.25 ms**. Normalized by request step, `cudaLaunchKernel` fell from **17312** calls/step in `$RESULT_ROOT/pplx_driver_contention_olen8.sqlite` to **13827** calls/step, but adding `cuGraphLaunch` gives **15939** launch-class calls/step, only about **8%** below the old profile. `cuEventRecord` stayed **2752** calls/step, and `cuStreamWaitEvent` stayed in the same range (**2378 -> 2418** calls/step). Mechanism lesson: graph islands this small do remove some kernel launches, but they do not remove the explicit stream handoffs and they replace many launches with graph launches; the effective unit must be a larger operator island or a generated static decode block, not per-helper graphlets.
-- **Cleanup result**: fine-grained graph island state/wrappers were removed from `state.rs`, `worker.rs`, and `block.rs`; NVTX instrumentation stayed. Local validation passed: `cargo fmt -p pegainfer-deepseek-v4`, `cargo check --release -p pegainfer-deepseek-v4 --features pplx-ep --bin deepseek_pplx_a2a_bench`, and `cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving`. H200 validation passed: `cargo build --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, then `PEGAINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 ./target/release/bench_serving --model-path $MODEL_DIR request --prompt-len 1 --output-len 16 --warmup 0 --iters 1` completed with `prefill_ms=617.40`, `first_decode_ms=239.31`, steady TPOT avg **158.84 ms**, p50 **143.99 ms**, p95 **184.00 ms**, max **255.91 ms**, samples **14**; log `$RESULT_ROOT/pplx_no_graph_islands_olen16.log`.
+- **Cleanup result**: fine-grained graph island state/wrappers were removed from `state.rs`, `worker.rs`, and `block.rs`; NVTX instrumentation stayed. Local validation passed: `cargo fmt -p openinfer-deepseek-v4`, `cargo check --release -p openinfer-deepseek-v4 --features pplx-ep --bin deepseek_pplx_a2a_bench`, and `cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving`. H200 validation passed: `cargo build --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, then `OPENINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 ./target/release/bench_serving --model-path $MODEL_DIR request --prompt-len 1 --output-len 16 --warmup 0 --iters 1` completed with `prefill_ms=617.40`, `first_decode_ms=239.31`, steady TPOT avg **158.84 ms**, p50 **143.99 ms**, p95 **184.00 ms**, max **255.91 ms**, samples **14**; log `$RESULT_ROOT/pplx_no_graph_islands_olen16.log`.
Current pplx worker-wait decomposition profile:
- **Target metric**: H200 EP8 `output_len=64` NVTX-only profile should explain the PPLX non-rank0 p50 lane (**~74 ms**) by named worker waits, with at least **62 steady samples**. This is diagnostic; it does not decide keep/revert for a perf code path.
- **Hypothesis**: if the 74ms non-rank0 lane is the per-layer PPLX worker state machine rather than model compute or raw payload transfer, then `p2p_all_to_all` p50 should be near `74ms / 43 layers ~= 1.7ms`, and one or two named waits should account for most of that per-layer p50.
- **Ceiling estimate**: eliminating 1ms/layer of worker wait has a direct ceiling of **~43 ms/token** on non-rank0 lanes, and rank0/wait-rank should follow because `logits_dtoh` is the final drain.
-- **Result**: confirmed. Instrumented only `WorkerState::step()` NVTX waits in `pegainfer-comm-p2p-all-to-all/src/a2a_worker.rs`; local `cargo fmt -p pegainfer-comm` passed, local `cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed, H200 release build passed. H200 profile `$RESULT_ROOT/pplx_worker_wait_nvtx_olen64.{log,sqlite,nsys-rep}` completed all 64 tokens: first decode **211.51 ms**, steady TPOT p50 **144.00 ms**, p95 **159.92 ms**, max **164.12 ms**, samples **62**.
+- **Result**: confirmed. Instrumented only `WorkerState::step()` NVTX waits in `openinfer-comm-p2p-all-to-all/src/a2a_worker.rs`; local `cargo fmt -p openinfer-comm` passed, local `cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed, H200 release build passed. H200 profile `$RESULT_ROOT/pplx_worker_wait_nvtx_olen64.{log,sqlite,nsys-rep}` completed all 64 tokens: first decode **211.51 ms**, steady TPOT p50 **144.00 ms**, p95 **159.92 ms**, max **164.12 ms**, samples **62**.
- **Worker evidence**:
- `p2p_all_to_all`: count **21680**, p50 **1.609 ms**, p95 **16.720 ms**, avg **3.951 ms**; count matches roughly `63 decode steps * 8 ranks * 43 MoE layers`.
- `worker_wait_combine_recv_done`: count **21672**, p50 **1.111 ms**, p95 **1.175 ms**, p99 **1.191 ms**. This is the stable per-layer floor.
@@ -749,14 +749,14 @@ Current single-node combine-recv grid clamp experiment:
- **Hypothesis**: if the stable 1.111 ms/layer combine floor comes from launching `a2a_combine_recv_kernel` as an SM-count cooperative grid even when `num_tokens=1`, then single-node `combine_recv` can launch only `min(num_tokens, num_sms)` blocks without changing output semantics, reducing the worker's `combine_recv_done` wait and the non-rank0 lane.
- **Ceiling estimate**: `worker_wait_combine_recv_done` p50 **1.111 ms/layer * 43 layers ~= 47.8 ms/token**. Even a 50% reduction is enough to meet the **20 ms** p50 gate.
- **Keep/revert criterion**: keep only if local/H200 builds pass, H200 `output_len=64` generates all tokens, p50 improves by >=20 ms, p95 <=165 ms, and a follow-up worker-wait profile confirms `worker_wait_combine_recv_done` p50 <=0.5 ms/layer. Revert on hang, CUDA illegal address, wrong-looking output failure, p50 regression, or p95 regression.
-- **Result**: reverted. Local `cargo fmt -p pegainfer-comm` passed and local `cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 build passed. H200 gate `$RESULT_ROOT/pplx_combine_recv_grid_olen64.log` generated all 64 tokens but measured first decode **195.83 ms**, steady TPOT avg **146.26 ms**, p50 **144.00 ms**, p95 **164.00 ms**, max **187.52 ms**, samples **62**. The process then hit the known teardown segfault, but metrics were already emitted; the forward gate failed because p50 did not improve. Mechanism lesson: the 1.111 ms/layer `worker_wait_combine_recv_done` floor is not fixed by reducing `a2a_combine_recv_kernel` from SM-count blocks to `num_tokens` blocks. The cost is more likely in the flag/worker completion protocol or combine-send/recv dependency chain than in empty cooperative-grid block count alone.
+- **Result**: reverted. Local `cargo fmt -p openinfer-comm` passed and local `cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 build passed. H200 gate `$RESULT_ROOT/pplx_combine_recv_grid_olen64.log` generated all 64 tokens but measured first decode **195.83 ms**, steady TPOT avg **146.26 ms**, p50 **144.00 ms**, p95 **164.00 ms**, max **187.52 ms**, samples **62**. The process then hit the known teardown segfault, but metrics were already emitted; the forward gate failed because p50 did not improve. Mechanism lesson: the 1.111 ms/layer `worker_wait_combine_recv_done` floor is not fixed by reducing `a2a_combine_recv_kernel` from SM-count blocks to `num_tokens` blocks. The cost is more likely in the flag/worker completion protocol or combine-send/recv dependency chain than in empty cooperative-grid block count alone.
Current single-node combine-recv host-flag skip experiment:
- **Target metric**: H200 EP8 `output_len=64` serving completes all 64 tokens; steady TPOT p50 improves by at least **20 ms** versus **144.00 ms**, p95 stays <= **165 ms**. Follow-up worker-wait profile should show `worker_wait_combine_recv_done` p50 below **0.5 ms/layer**.
- **Hypothesis**: if the stable `worker_wait_combine_recv_done` p50 comes from `a2a_combine_recv_kernel` polling a host-set GDR flag, then single-node (`world_size == node_size`) can skip `combine_recv_flag` because there are no fabric combine payloads; same-stream ordering plus `sync_ptrs` already protect local NVLink combine copies. This should remove the per-layer host→GPU flag latency without changing cross-node behavior.
- **Ceiling estimate**: CUDA+NVTX profile `$RESULT_ROOT/pplx_cuda_nvtx_olen8.sqlite` captured `a2a_combine_recv_kernel` p50 **179 us** but worker `combine_recv_done` wait p50 **1.113 ms**, leaving roughly **0.9 ms/layer** unexplained by kernel compute. That is **~38 ms/token** across 43 MoE layers.
- **Keep/revert criterion**: keep only if local/H200 builds pass, H200 `output_len=64` generates all tokens, p50 improves by >=20 ms, p95 <=165 ms, and worker-wait profile confirms the `combine_recv_done` floor moved. Revert on hang, CUDA illegal address, wrong-looking output failure, p50 regression, or p95 regression.
-- **Result**: reverted. Local build with `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; H200 release build passed. H200 gate `$RESULT_ROOT/pplx_combine_recv_skip_host_flag_olen64.log` completed with exit status 0 and generated all 64 tokens, but measured first decode **199.75 ms**, steady TPOT avg **144.71 ms**, p50 **144.00 ms**, p95 **160.00 ms**, max **164.00 ms**, samples **62**. The p50 did not move, so skipping the host flag is not enough. Mechanism lesson: the `combine_recv_done` wait range is not explained by either empty cooperative-grid blocks or the `combine_recv_flag` MMIO poll in isolation; the remaining floor is likely the broader same-stream combine-send/recv dependency plus worker state-machine cadence.
+- **Result**: reverted. Local build with `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; H200 release build passed. H200 gate `$RESULT_ROOT/pplx_combine_recv_skip_host_flag_olen64.log` completed with exit status 0 and generated all 64 tokens, but measured first decode **199.75 ms**, steady TPOT avg **144.71 ms**, p50 **144.00 ms**, p95 **160.00 ms**, max **164.00 ms**, samples **62**. The p50 did not move, so skipping the host flag is not enough. Mechanism lesson: the `combine_recv_done` wait range is not explained by either empty cooperative-grid blocks or the `combine_recv_flag` MMIO poll in isolation; the remaining floor is likely the broader same-stream combine-send/recv dependency plus worker state-machine cadence.
Current a2a device wait-counter profile:
- **Target metric**: H200 short run should emit device-side wait counters for all four a2a kernels at shutdown, especially `combine_recv recv_flag_avg_cycles` and `combine_recv nvlink_sum_cycles`. This is diagnostic only.
@@ -777,7 +777,7 @@ Current single-node active-source combine mask experiment:
- **Hypothesis**: if the previous source-specific hang came from kernel-side source inference rather than the protocol itself, then a worker-derived exact active-source mask should avoid the hang and reduce the `sync_ptrs[local_rank][peer + NODE_SIZE]` all-peer wait.
- **Ceiling estimate**: same as the previous source-specific attempt: the all-peer `combine_recv nvlink_sum_cycles` counter is large enough that removing non-source peers could plausibly exceed the **20 ms/token** gate if the protocol allowed it.
- **Keep/revert criterion**: keep only if local/H200 builds pass, H200 `output_len=64` generates all tokens, p50 improves by >=20 ms, and p95 <=165 ms. Revert on hang, illegal address, wrong-looking output failure, p50 regression, or p95 regression.
-- **Result**: reverted. The implementation used `num_recv_tokens[2]` as a GDR-visible active-source mask, updated the C++/Rust FFI signature, and made warp1 in `a2a_combine_recv_kernel` wait only mask lanes. Local `cargo fmt -p pegainfer-comm` passed and local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 release build passed, but `$RESULT_ROOT/pplx_active_source_mask_olen64.log` timed out with status **124** before any forward metric. The experiment was removed locally and remotely; restored H200 build passed and `$RESULT_ROOT/pplx_restored_after_mask_olen8.log` completed with 8 tokens, steady p50 **143.96 ms**. Mechanism lesson: even an exact active-source wait set is not a safe local change. The all-peer combine sync is part of a larger bidirectional buffer-reuse/state-machine protocol, not a pure data-dependency wait.
+- **Result**: reverted. The implementation used `num_recv_tokens[2]` as a GDR-visible active-source mask, updated the C++/Rust FFI signature, and made warp1 in `a2a_combine_recv_kernel` wait only mask lanes. Local `cargo fmt -p openinfer-comm` passed and local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 release build passed, but `$RESULT_ROOT/pplx_active_source_mask_olen64.log` timed out with status **124** before any forward metric. The experiment was removed locally and remotely; restored H200 build passed and `$RESULT_ROOT/pplx_restored_after_mask_olen8.log` completed with 8 tokens, steady p50 **143.96 ms**. Mechanism lesson: even an exact active-source wait set is not a safe local change. The all-peer combine sync is part of a larger bidirectional buffer-reuse/state-machine protocol, not a pure data-dependency wait.
Direct-combine feasibility probe:
- **Target metric**: determine whether a future direct-combine prototype can reuse ordinary `CudaSlice` pointers (`expert_out.data`) through CUDA peer access, or whether `expert_out` must be reallocated as bootstrap-managed CUMem.
@@ -789,14 +789,14 @@ Current direct-combine prototype:
- **Target metric**: H200 EP8 `output_len=64` serving completes all 64 tokens; steady TPOT p50 improves by at least **20 ms** versus **144.00 ms**, p95 stays <= **165 ms**. Short `output_len=8` smoke must generate all 8 tokens before the long gate.
- **Hypothesis**: if the 1.111 ms/layer combine-completion floor comes from copying routed expert output through legacy `combine_send -> recv_buffer -> combine_recv`, then a single-node direct-combine kernel can publish local `expert_out` readiness, wait for peer readiness, compute the same padded source index from `indices/token_offset/num_routed`, and reduce directly from peer `expert_out` pointers. This should remove one legacy combine-send payload/copy stage and materially lower the non-rank0 lane.
- **Ceiling estimate**: `worker_wait_combine_recv_done` p50 is **1.111 ms/layer**, or **~47.8 ms/token** across 43 layers. A direct-combine path only needs to recover ~40% of that floor to pass the **20 ms** p50 gate.
-- **Implementation result**: local `cargo fmt -p pegainfer-comm -p pegainfer-deepseek-v4` passed. Local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 release build passed after syncing the direct kernel, cxx FFI, `AllToAllContext::direct_combine_recv`, `EpBackend::direct_combine_recv`, the peer `expert_out` pointer table, and the `moe_pplx.rs` call site.
+- **Implementation result**: local `cargo fmt -p openinfer-comm -p openinfer-deepseek-v4` passed. Local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 release build passed after syncing the direct kernel, cxx FFI, `AllToAllContext::direct_combine_recv`, `EpBackend::direct_combine_recv`, the peer `expert_out` pointer table, and the `moe_pplx.rs` call site.
- **Gate result**: failed and disabled. Initial H200 `$RESULT_ROOT/pplx_direct_combine_olen8.log` timed out with status **124** after benchmark start. Metadata-only direct-combine also timed out until the direct kernel waited for all peer first-half flags before publishing its second-half ready flag; after that protocol fix, `$RESULT_ROOT/pplx_direct_metadata_waitfix_olen8.log` generated all 8 tokens with steady p50 **152.00 ms**. Full direct then localized CUDA 700 to `direct_combine_recv` under `CUDA_LAUNCH_BLOCKING=1` (`$RESULT_ROOT/pplx_direct_full_waitfix_lblock_olen2.log`). Enabling both CUDA peer access and default-mempool peer access fixed the illegal address: `$RESULT_ROOT/pplx_direct_full_mempool_lblock_olen2.log` completed 2 tokens, and `$RESULT_ROOT/pplx_direct_full_mempool_olen8.log` completed 8 tokens with steady avg **141.30 ms**, p50 **143.99 ms**, p95/max **144.04 ms**. The required long gate `$RESULT_ROOT/pplx_direct_full_mempool_olen64.log` did not emit request metrics before the process ended, so the gate failed. The hot path is now hard-disabled with `USE_SINGLE_NODE_DIRECT_COMBINE=false` while the compiled prototype remains dormant. Restored H200 release build passed, and `$RESULT_ROOT/pplx_direct_false_mempool_olen64.log` generated all 64 tokens with first decode **178.04 ms**, steady avg **146.13 ms**, p50 **144.00 ms**, p95 **160.00 ms**, max **164.02 ms**, samples **62**; the process then hit the known teardown-time NCCL abort with status **134** after metrics were printed.
- **Mechanism lesson**: direct peer pointer addressability is solvable, and the first protocol deadlock was specifically an early overwrite of the second-half sync slots before lagging peers consumed the previous value. But replacing only `combine_send + combine_recv` inside the legacy worker step does not prove a p50 win: short full-direct p50 stayed **143.99 ms**, and the long run produced no gate metric. The next version needs a distinct single-node worker mode that removes the legacy combine stage/barrier from the state machine, instead of dropping a GPU-only data path behind the same worker cadence.
Single-node direct worker mode experiment:
- **Target metric**: H200 EP8 `output_len=64` serving completes all 64 tokens; steady TPOT p50 improves by at least **20 ms** versus **144.00 ms**, p95 stays <= **165 ms**. Worker-wait NVTX should show `worker_wait_combine_recv_done` p50 below **0.5 ms/layer** without simply moving the same wait into another range.
- **Change**: added an explicit `single_node_direct_combine_enabled` mode on `WorkerState`, exposed through `AllToAllContext` and `EpBackend`. `moe_pplx.rs` sets the mode before `dispatch_send` when the direct-combine branch is active. In that mode the worker keeps route/dispatch processing but skips the dispatch and combine fabric barriers, waits for the direct kernel's `combine_send_done`/`combine_recv_done`, then releases `tx_ready` directly. This isolates the single-node direct path from the legacy barrier cadence without changing the legacy combine path.
-- **Result**: failed and disabled. Local `cargo fmt -p pegainfer-comm -p pegainfer-deepseek-v4` passed and local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 release build passed. `$RESULT_ROOT/pplx_direct_worker_mode_olen8.log` completed with status 0, steady avg **145.30 ms**, p50 **144.01 ms**, p95/max **151.61 ms**. `$RESULT_ROOT/pplx_direct_worker_mode_olen64.log` generated all 64 tokens, first decode **239.77 ms**, steady avg **147.61 ms**, p50 **144.00 ms**, p95 **164.00 ms**, max **176.21 ms**, samples **62**, then hit known teardown status **134** after metrics. The p50 gate failed, so `USE_SINGLE_NODE_DIRECT_COMBINE` is back to false.
+- **Result**: failed and disabled. Local `cargo fmt -p openinfer-comm -p openinfer-deepseek-v4` passed and local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed. H200 release build passed. `$RESULT_ROOT/pplx_direct_worker_mode_olen8.log` completed with status 0, steady avg **145.30 ms**, p50 **144.01 ms**, p95/max **151.61 ms**. `$RESULT_ROOT/pplx_direct_worker_mode_olen64.log` generated all 64 tokens, first decode **239.77 ms**, steady avg **147.61 ms**, p50 **144.00 ms**, p95 **164.00 ms**, max **176.21 ms**, samples **62**, then hit known teardown status **134** after metrics. The p50 gate failed, so `USE_SINGLE_NODE_DIRECT_COMBINE` is back to false.
- **Profile result**: `$RESULT_ROOT/pplx_direct_worker_mode_nvtx_olen16.{log,sqlite,nsys-rep}` confirms the new mode really skipped the hot-path barriers: `barrier` ranges dropped from **43344** in `$RESULT_ROOT/pplx_worker_wait_nvtx_olen64.sqlite` to **16** in the direct-worker profile. But the wait moved, not disappeared. Baseline worker p50s were `worker_wait_combine_send_done` **0.003 ms** and `worker_wait_combine_recv_done` **1.111 ms**; direct-worker mode changed them to **0.970 ms** and **0.224 ms** respectively, while `p2p_all_to_all` p50 stayed **1.669 ms** vs baseline **1.609 ms**. This means the worker now waits earlier for grouped-GEMM/direct-kernel readiness rather than later for combine-recv completion.
- **Restore validation**: after disabling direct again, H200 release build passed and `$RESULT_ROOT/pplx_direct_mode_restored_olen64.log` generated all 64 tokens with first decode **199.80 ms**, steady avg **144.84 ms**, p50 **144.00 ms**, p95 **156.01 ms**, max **168.22 ms**, samples **62**. It then hit the known teardown segfault after metrics.
- **Mechanism lesson**: removing legacy barriers around direct-combine is insufficient because the per-layer p50 budget is not only barrier/worker completion overhead. In the direct path, the worker reaches combine stage before local expert output is ready and waits for `combine_send_done`, so the same per-layer lane time remains visible. The next p50 attempt should stop treating `p2p_all_to_all` duration as pure communication overhead and instead correlate `worker_wait_combine_send_done` with grouped FP4/local MoE compute and stream/event handoff. A correct optimization now needs either a cheaper local expert path / fewer grouped rows, or a schedule where the worker is not the serialized lane owner for local expert readiness.
@@ -806,7 +806,7 @@ Single-node direct worker early-release experiment:
- **Hypothesis**: if direct-worker mode failed because the worker still waited for local expert output readiness (`worker_wait_combine_send_done` p50 **0.970 ms/layer**), then in the direct-combine path the worker can release `tx_ready` immediately after `dispatch_recv_done` and return to the next step. The direct combine kernel remains ordered on `moe_stream` before the next layer's `dispatch_send`, and it owns `sync_counter`/`sync_ptrs` completion without the worker spinning on expert readiness.
- **Ceiling estimate**: direct-worker-mode `p2p_all_to_all` p50 was **1.669 ms/layer** and `worker_wait_combine_send_done` p50 was **0.970 ms/layer**. Removing that worker wait has a theoretical ceiling of **~41.7 ms/token** across 43 MoE layers; even half of that passes the **20 ms** p50 gate.
- **Keep/revert criterion**: keep only if local/H200 builds pass, H200 `output_len=8` smoke generates all tokens, H200 `output_len=64` generates all tokens with p50 <= **124 ms** and p95 <= **165 ms**, and a follow-up worker-wait profile confirms `p2p_all_to_all` p50 <= **1.1 ms/layer**. Revert on hang, CUDA illegal address, stale sync/counter behavior, p50 regression, p95 regression, or teardown/stop deadlock before metrics.
-- **Result**: reverted. The change set `USE_SINGLE_NODE_DIRECT_COMBINE=true` and made direct mode release `tx_ready` immediately after `dispatch_recv_done`, before returning from `WorkerState::step()`. Local `cargo fmt -p pegainfer-comm -p pegainfer-deepseek-v4` passed; local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; H200 release build passed. H200 short smoke `$RESULT_ROOT/pplx_direct_worker_early_release_olen8.log` generated 8 tokens with steady avg **146.73 ms**, p50 **144.00 ms**, p95/max **164.07 ms**, then hit the known teardown segfault after metrics. The full gate `$RESULT_ROOT/pplx_direct_worker_early_release_olen64.log` timed out with status **124** before any forward metric. The experiment was removed and `USE_SINGLE_NODE_DIRECT_COMBINE=false` restored; H200 release build passed and `$RESULT_ROOT/pplx_after_early_release_revert_olen8.log` generated 8 tokens with steady avg **144.30 ms**, p50 **144.03 ms**, p95/max **151.98 ms**.
+- **Result**: reverted. The change set `USE_SINGLE_NODE_DIRECT_COMBINE=true` and made direct mode release `tx_ready` immediately after `dispatch_recv_done`, before returning from `WorkerState::step()`. Local `cargo fmt -p openinfer-comm -p openinfer-deepseek-v4` passed; local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; H200 release build passed. H200 short smoke `$RESULT_ROOT/pplx_direct_worker_early_release_olen8.log` generated 8 tokens with steady avg **146.73 ms**, p50 **144.00 ms**, p95/max **164.07 ms**, then hit the known teardown segfault after metrics. The full gate `$RESULT_ROOT/pplx_direct_worker_early_release_olen64.log` timed out with status **124** before any forward metric. The experiment was removed and `USE_SINGLE_NODE_DIRECT_COMBINE=false` restored; H200 release build passed and `$RESULT_ROOT/pplx_after_early_release_revert_olen8.log` generated 8 tokens with steady avg **144.30 ms**, p50 **144.03 ms**, p95/max **151.98 ms**.
- **Mechanism lesson**: direct-combine completion cannot be detached from the worker simply by releasing `tx_ready` after dispatch. Short runs can survive, but longer decode eventually wedges, which means the worker still owns part of the per-layer lifetime beyond send-buffer reuse. The next state-machine change needs an explicit completion acknowledgment or a persistent GPU progress design; early host release alone is not a safe scheduling model.
PPLX routed-MoE ceiling experiment:
@@ -814,18 +814,18 @@ PPLX routed-MoE ceiling experiment:
- **Hypothesis**: if the 144 ms p50 is dominated by PPLX routed-MoE composition, this fake shared-only run should drop near the NCCL 60 ms class. If it stays far above 100 ms, the next bottleneck is outside routed MoE and PPLX state-machine work has a lower ceiling.
- **Ceiling estimate**: current PPLX p50 **144 ms** vs NCCL p50 **63 ms** leaves **~81 ms/token**. Removing all routed-MoE PPLX work is the maximum possible PPLX-side win; any real implementation has a lower ceiling.
- **Keep/revert criterion**: never keep the code path. Record the number, then restore the real routed-MoE path and verify local build health.
-- **Result**: reverted. The first remote run accidentally synced `moe_pplx.rs` to the repository root and reproduced the baseline p50 **144.00 ms**; the real source was then synced to `pegainfer-deepseek-v4/src/runtime/moe_pplx.rs` and grep verified `PPLX_SHARED_ONLY_CEILING=true`. H200 `$RESULT_ROOT/pplx_shared_only_ceiling_real_olen64.log` generated all 64 tokens, then hit the known teardown segfault after metrics. Metrics: first decode **30.31 ms**, steady TPOT avg **21.69 ms**, p50 **21.84 ms**, p95 **24.27 ms**, max **25.74 ms**, samples **62**. The output is intentionally invalid, but the performance bound is decisive: removing routed MoE/PPLX work drops far below the NCCL p50 target, so the remaining 144 ms p50 is overwhelmingly in routed-MoE/PPLX composition. The code was restored locally and remotely; local release check passed, remote grep confirmed `PPLX_SHARED_ONLY_CEILING` is absent, remote release build passed, and `$RESULT_ROOT/pplx_restored_after_shared_ceiling_olen8.log` returned to the real path with steady p50 **144.03 ms**. The useful optimization direction is a real single-node routed path that avoids the current four PPLX cooperative kernels plus worker state-machine cadence, not attention/sampling/logits work.
+- **Result**: reverted. The first remote run accidentally synced `moe_pplx.rs` to the repository root and reproduced the baseline p50 **144.00 ms**; the real source was then synced to `openinfer-deepseek-v4/src/runtime/moe_pplx.rs` and grep verified `PPLX_SHARED_ONLY_CEILING=true`. H200 `$RESULT_ROOT/pplx_shared_only_ceiling_real_olen64.log` generated all 64 tokens, then hit the known teardown segfault after metrics. Metrics: first decode **30.31 ms**, steady TPOT avg **21.69 ms**, p50 **21.84 ms**, p95 **24.27 ms**, max **25.74 ms**, samples **62**. The output is intentionally invalid, but the performance bound is decisive: removing routed MoE/PPLX work drops far below the NCCL p50 target, so the remaining 144 ms p50 is overwhelmingly in routed-MoE/PPLX composition. The code was restored locally and remotely; local release check passed, remote grep confirmed `PPLX_SHARED_ONLY_CEILING` is absent, remote release build passed, and `$RESULT_ROOT/pplx_restored_after_shared_ceiling_olen8.log` returned to the real path with steady p50 **144.03 ms**. The useful optimization direction is a real single-node routed path that avoids the current four PPLX cooperative kernels plus worker state-machine cadence, not attention/sampling/logits work.
Single-node peer-memory routed path groundwork:
- **Target metric**: behavior-preserving setup change only. Existing PPLX `output_len=8` smoke should still generate all tokens and stay in the real-path p50 **144 ms** class. This patch does not claim a TPOT win.
- **Change**: `EnablePplx` now returns a `PplxPeerScratchPtrs` bundle instead of only `expert_out`. Each rank installs peer pointer tables for `expert_out`, `expanded_input`, `recv_tokens_per_expert`, `expert_indptr`, and the EP backend's full `num_routed` table into `MoePplxScratch`. Existing direct-combine keeps consuming `peer_expert_out_ptrs`; the new pointer tables are dormant until a direct-dispatch kernel lands.
- **Why**: a correct single-node direct dispatch should run on sender ranks, read local `input + route_indices`, and write directly into peer `expanded_input` plus peer per-expert counters. That requires persistent peer destination pointers; trying to ask for peer input pointers per layer would fight the rank-worker ownership model.
-- **Validation**: local `cargo fmt -p pegainfer-comm -p pegainfer-deepseek-v4`, local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, and `git diff --check` passed. H200 release build passed. H200 `$RESULT_ROOT/pplx_peer_ptr_tables_olen8.log` generated all 8 tokens on the real path: first decode **255.71 ms**, steady avg **143.96 ms**, p50 **144.01 ms**, p95/max **144.02 ms**, then hit the known teardown segfault after metrics. After adding the `num_routed` table, H200 `$RESULT_ROOT/pplx_peer_num_routed_tables_olen8.log` generated all 8 tokens with first decode **219.85 ms**, steady avg **142.31 ms**, p50 **144.00 ms**, p95/max **146.10 ms**, then hit the same known teardown segfault after metrics. This validates the peer pointer table expansion as behavior-preserving groundwork.
+- **Validation**: local `cargo fmt -p openinfer-comm -p openinfer-deepseek-v4`, local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, and `git diff --check` passed. H200 release build passed. H200 `$RESULT_ROOT/pplx_peer_ptr_tables_olen8.log` generated all 8 tokens on the real path: first decode **255.71 ms**, steady avg **143.96 ms**, p50 **144.01 ms**, p95/max **144.02 ms**, then hit the known teardown segfault after metrics. After adding the `num_routed` table, H200 `$RESULT_ROOT/pplx_peer_num_routed_tables_olen8.log` generated all 8 tokens with first decode **219.85 ms**, steady avg **142.31 ms**, p50 **144.00 ms**, p95/max **146.10 ms**, then hit the same known teardown segfault after metrics. This validates the peer pointer table expansion as behavior-preserving groundwork.
Single-node peer-memory direct routed experiment:
- **Target metric**: H200 `output_len=64` must generate all 64 tokens and beat the pre-written gate: steady p50 <= **124 ms** and p95 <= **165 ms**. This is a correctness-path experiment, unlike the fake shared-only ceiling run.
-- **Change**: added `a2a_direct_dispatch` to `pegainfer-comm-a2a-kernels`, exposed it through `AllToAllContext::direct_dispatch` and `EpBackend::direct_dispatch`, and hard-coded `USE_SINGLE_NODE_DIRECT_ROUTED=true` in `moe_pplx.rs`. The kernel runs on sender ranks, counts local routes, writes each source row into every peer's `num_routed` table, builds the destination rank's `recv_tokens_per_expert` and padded `expert_indptr`, writes routed BF16 activations directly into peer `expanded_input`, and advances the existing `sync_counter/sync_ptrs` protocol so the existing direct-combine kernel can read peer `expert_out` by the same `base + source-prefix + token_offset` formula.
-- **Validation**: local `cargo fmt -p pegainfer-comm -p pegainfer-deepseek-v4`, local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, and `git diff --check` passed. H200 release build passed. H200 `$RESULT_ROOT/pplx_direct_routed_olen8.log` generated all 8 tokens: first decode **140.19 ms**, steady avg **87.89 ms**, p50 **89.32 ms**, p95/max **92.17 ms**, then hit the known teardown segfault after metrics. H200 `$RESULT_ROOT/pplx_direct_routed_olen64.log` generated all 64 tokens: first decode **152.45 ms**, steady avg **86.05 ms**, p50 **83.94 ms**, p95 **94.12 ms**, p99 **103.54 ms**, max **107.80 ms**, then hit the same teardown segfault after metrics.
+- **Change**: added `a2a_direct_dispatch` to `openinfer-comm-a2a-kernels`, exposed it through `AllToAllContext::direct_dispatch` and `EpBackend::direct_dispatch`, and hard-coded `USE_SINGLE_NODE_DIRECT_ROUTED=true` in `moe_pplx.rs`. The kernel runs on sender ranks, counts local routes, writes each source row into every peer's `num_routed` table, builds the destination rank's `recv_tokens_per_expert` and padded `expert_indptr`, writes routed BF16 activations directly into peer `expanded_input`, and advances the existing `sync_counter/sync_ptrs` protocol so the existing direct-combine kernel can read peer `expert_out` by the same `base + source-prefix + token_offset` formula.
+- **Validation**: local `cargo fmt -p openinfer-comm -p openinfer-deepseek-v4`, local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, and `git diff --check` passed. H200 release build passed. H200 `$RESULT_ROOT/pplx_direct_routed_olen8.log` generated all 8 tokens: first decode **140.19 ms**, steady avg **87.89 ms**, p50 **89.32 ms**, p95/max **92.17 ms**, then hit the known teardown segfault after metrics. H200 `$RESULT_ROOT/pplx_direct_routed_olen64.log` generated all 64 tokens: first decode **152.45 ms**, steady avg **86.05 ms**, p50 **83.94 ms**, p95 **94.12 ms**, p99 **103.54 ms**, max **107.80 ms**, then hit the same teardown segfault after metrics.
- **Result**: keep for the next profiling pass. The p50 gate passed by **~60 ms/token** versus the old **144.00 ms** PPLX p50, and p95 moved from the previous **~160 ms** class to **94.12 ms**. The remaining gap to NCCL p50 **~63 ms** is now about **21 ms/token**; the next evidence should come from a direct-path CUDA+NVTX profile, not the old worker-wait profile.
Direct routed follow-up tightening:
@@ -848,17 +848,17 @@ Direct active-peer sync attempts:
GPU-only compact grouped attempt:
- **Change**: added compact scratch buffers and two CUDA wrappers to compact padded `expanded_input` into an unpadded layout, run grouped FP4 with host rows `world_size * num_tokens * topk` (**48** for bs=1/EP8/topk6), then scatter `compact_out` back to padded `expert_out` so direct combine could keep its address formula.
-- **Validation**: local `cargo fmt`, local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, `git diff --check`, H200 release build, and H200 `$RESULT_ROOT/pplx_compact_grouped_olen8.log` all passed. The smoke generated 8/8 tokens with p50 **86.51 ms**.
+- **Validation**: local `cargo fmt`, local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, `git diff --check`, H200 release build, and H200 `$RESULT_ROOT/pplx_compact_grouped_olen8.log` all passed. The smoke generated 8/8 tokens with p50 **86.51 ms**.
- **Gate result**: H200 `$RESULT_ROOT/pplx_compact_grouped_olen64.log` generated all 64 tokens but steady p50 regressed to **84.00 ms**, p95 **97.38 ms**, max **104.21 ms**. This missed the **<=70 ms** p50 / **<=92 ms** p95 gate and was reverted. The result says the extra compact/scatter launches and fixed API/scheduling cost exceed the benefit of reducing grouped rows from 512 to 48 in this shape.
Direct combine on compute stream attempt:
- **Change**: left direct dispatch on `moe_stream` so it could still overlap with shared expert, but moved `direct_combine_recv` to `ctx.stream`, removing the direct-path `expert_handoff` and `combine_handoff` event pair.
-- **Validation**: local `cargo fmt`, local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, `git diff --check`, and H200 release build passed.
+- **Validation**: local `cargo fmt`, local `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving`, `git diff --check`, and H200 release build passed.
- **Gate result**: H200 `$RESULT_ROOT/pplx_direct_combine_ctx_stream_olen64.log` generated all 64 tokens with p50 **77.11 ms**, p95 **90.41 ms**, p99 **92.13 ms**, max **103.39 ms**. This is within rows512 noise and missed the prewritten **<=74 ms** p50 gate, so it was reverted. The result says the two event handoffs after grouped FP4 are not a large p50 owner by themselves.
Rows512 direct clean PPLX vs NCCL profile:
-- **PPLX command/profile**: `/usr/local/cuda-12.9/bin/nsys profile --trace=cuda,nvtx,osrt --sample=none --cuda-event-trace=false --cuda-flush-interval=100 --force-overwrite=true --stats=false -o $RESULT_ROOT/pplx_rows512_narrow_olen16 env PEGAINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 ./target/release/bench_serving --model-path $MODEL_DIR request --prompt-len 1 --output-len 16 --warmup 0 --iters 1`. Artifacts: `$RESULT_ROOT/pplx_rows512_narrow_olen16.log`, `$RESULT_ROOT/pplx_rows512_narrow_olen16.nsys-rep`, `$RESULT_ROOT/pplx_rows512_narrow_olen16.sqlite`. It generated 16/16 tokens with steady avg **80.76 ms**, p50 **79.08 ms**, p95 **86.74 ms**, max **91.02 ms**. Kernel rows survived on all 8 devices: **38,871** per GPU overall and **34,146** per GPU in the steady window.
-- **NCCL comparison**: same command without `PEGAINFER_DSV4_PPLX=1`, artifacts `$RESULT_ROOT/nccl_clean_compare_olen16.log`, `$RESULT_ROOT/nccl_clean_compare_olen16.nsys-rep`, `$RESULT_ROOT/nccl_clean_compare_olen16.sqlite`. It generated 16/16 tokens and hit the known teardown segfault after metrics, with steady avg **63.84 ms**, p50 **63.17 ms**, p95 **66.01 ms**, max **69.82 ms**. The NCCL sqlite lost some per-device kernel rows during teardown, so use its runtime/NVTX rank data for comparison rather than full 8-GPU kernel accounting.
+- **PPLX command/profile**: `/usr/local/cuda-12.9/bin/nsys profile --trace=cuda,nvtx,osrt --sample=none --cuda-event-trace=false --cuda-flush-interval=100 --force-overwrite=true --stats=false -o $RESULT_ROOT/pplx_rows512_narrow_olen16 env OPENINFER_DSV4_PPLX=1 NCCL_NVLS_ENABLE=0 ./target/release/bench_serving --model-path $MODEL_DIR request --prompt-len 1 --output-len 16 --warmup 0 --iters 1`. Artifacts: `$RESULT_ROOT/pplx_rows512_narrow_olen16.log`, `$RESULT_ROOT/pplx_rows512_narrow_olen16.nsys-rep`, `$RESULT_ROOT/pplx_rows512_narrow_olen16.sqlite`. It generated 16/16 tokens with steady avg **80.76 ms**, p50 **79.08 ms**, p95 **86.74 ms**, max **91.02 ms**. Kernel rows survived on all 8 devices: **38,871** per GPU overall and **34,146** per GPU in the steady window.
+- **NCCL comparison**: same command without `OPENINFER_DSV4_PPLX=1`, artifacts `$RESULT_ROOT/nccl_clean_compare_olen16.log`, `$RESULT_ROOT/nccl_clean_compare_olen16.nsys-rep`, `$RESULT_ROOT/nccl_clean_compare_olen16.sqlite`. It generated 16/16 tokens and hit the known teardown segfault after metrics, with steady avg **63.84 ms**, p50 **63.17 ms**, p95 **66.01 ms**, max **69.82 ms**. The NCCL sqlite lost some per-device kernel rows during teardown, so use its runtime/NVTX rank data for comparison rather than full 8-GPU kernel accounting.
- **Rank-lane accounting**: PPLX rank0-like decode p50 **78.334 ms** versus NCCL rank0-like p50 **62.857 ms**, gap **15.477 ms**. On the same rank0-like lane, PPLX launch API p50 **36.307 ms** versus NCCL **27.521 ms** (gap **8.786 ms**), and PPLX final D2H/drain p50 **32.651 ms** versus NCCL **25.964 ms** (gap **6.687 ms**). These two gaps sum to **15.473 ms**, matching the rank0 decode p50 gap. Non-rank0 PPLX decode p50 median is **43.656 ms**; NCCL non-rank0 p50 median is **36.495 ms**.
- **PPLX steady runtime API**: `cudaLaunchKernel_v7000` dominates by total time with **239,568** calls totaling **3913.162 ms**; `cuMemcpyDtoHAsync_v2` has **14** calls totaling **469.559 ms** and is the final queue drain; `cuLaunchKernelEx` totals **76.816 ms** and `cudaLaunchCooperativeKernel_v9000` totals **59.689 ms**. Event waits/records are small at this profile granularity compared with launch and drain.
- **Launch owners**: PPLX launch API time is mostly in HC / GEMV / TileLang / grouped wrappers, not the direct kernels. Top correlated launch totals include `deepseek_hc_bf16_to_f32_kernel` **574.6 ms**, cuBLAS `gemvx::kernel...` **444.3 ms**, `deepseek_hc_scale_mixes_block_kernel` **343.7 ms**, `deepseek_tilelang_fp8_gemm_n4096_k1024_kernel` **272.9 ms**, and `deepseek_hc_pre_norm_from_mixes_kernel` **256.0 ms**. Direct dispatch launch total is **37.0 ms** and direct combine launch total is **22.6 ms** across all rank threads in the steady window.
@@ -905,14 +905,14 @@ Current direct route-position reuse experiment:
- **Hypothesis**: `a2a_direct_dispatch_kernel` already computes the exact `(source_rank, padded position)` for each local token route when writing peer `expanded_input`. `a2a_direct_combine_recv_kernel` recomputes the same base/prefix position from `num_routed` and `token_offset` before reading peer `expert_out`. Persisting dispatch's per-route `position/source_rank` in direct workspace and feeding it to direct combine removes the duplicated metadata pass and shared-memory position staging.
- **Ceiling estimate**: clean rows512 rank0 steady profile shows direct combine kernel body **320.282 ms** and direct dispatch kernel body **158.331 ms** across 14 rank0 steady steps. This change targets direct combine body work only; expected p50 gain is keepable only if the removed position calculation shifts wall-clock by **>=5 ms/token** or materially reduces final drain.
- **Keep/revert criterion**: keep only if local/H200 release builds pass, PPLX `output_len=64` generates all 64 tokens with p50 improved by **>=5 ms/token** and p95 <= **92 ms**. Revert on build failure, CUDA error, hang/timeout, wrong-looking output, p50 movement under the gate, or p95 regression.
-- **Result**: reverted. Local `git diff --check` and `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; H200 release build passed. H200 smoke `$RESULT_ROOT/pplx_routepos_reuse_olen8.log` generated 8/8 tokens with steady p50 **74.41 ms**. Full gates generated all 64 tokens twice: `$RESULT_ROOT/pplx_routepos_reuse_olen64.log` measured first decode **140.74 ms**, steady avg **76.95 ms**, p50 **74.52 ms**, p95 **87.90 ms**, p99 **89.30 ms**, max **98.87 ms**; `$RESULT_ROOT/pplx_routepos_reuse_olen64_r2.log` measured first decode **129.11 ms**, steady avg **78.23 ms**, p50 **74.54 ms**, p95 **89.30 ms**, p99 **89.74 ms**, max **89.82 ms**. Teardown hit the known status 139 segfault after metrics. The result is a repeatable 2.8-4.6 ms p50 improvement and healthier p95, but it misses the prewritten **>=5 ms/token** p50 gate, so the code was removed and H200 was rebuilt after revert. Lesson: duplicated direct-combine position calculation is real but too small alone; the next retained change has to merge a larger direct-side stage or reduce launch/queue depth.
+- **Result**: reverted. Local `git diff --check` and `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; H200 release build passed. H200 smoke `$RESULT_ROOT/pplx_routepos_reuse_olen8.log` generated 8/8 tokens with steady p50 **74.41 ms**. Full gates generated all 64 tokens twice: `$RESULT_ROOT/pplx_routepos_reuse_olen64.log` measured first decode **140.74 ms**, steady avg **76.95 ms**, p50 **74.52 ms**, p95 **87.90 ms**, p99 **89.30 ms**, max **98.87 ms**; `$RESULT_ROOT/pplx_routepos_reuse_olen64_r2.log` measured first decode **129.11 ms**, steady avg **78.23 ms**, p50 **74.54 ms**, p95 **89.30 ms**, p99 **89.74 ms**, max **89.82 ms**. Teardown hit the known status 139 segfault after metrics. The result is a repeatable 2.8-4.6 ms p50 improvement and healthier p95, but it misses the prewritten **>=5 ms/token** p50 gate, so the code was removed and H200 was rebuilt after revert. Lesson: duplicated direct-combine position calculation is real but too small alone; the next retained change has to merge a larger direct-side stage or reduce launch/queue depth.
Current reusable PPLX handoff events experiment:
- **Target metric**: H200 rows512 PPLX `output_len=64` steady p50 should improve by at least **5 ms/token** from the retained **77-79 ms** baseline, p95 should stay <= **92 ms**, and all 64 tokens must be generated. NCCL path is untouched; rebuild validation is enough unless shared code moves.
- **Hypothesis**: `moe_pplx.rs` creates a fresh CUDA event for each explicit stream handoff (`route`, `direct_dispatch`, `indptr`, `expert`, `combine`) via `CudaStream::record_event(Some(DISABLE_TIMING))`. The clean rows512 decode window shows `cuEventCreate` **20640** calls / **48.418 ms** and `cuEventDestroy` **20640** calls / **10.168 ms**, while `cudaEventRecord/cuEventRecord/cuStreamWaitEvent` remain separate. Preallocating the handoff events in `MoePplxScratch` and re-recording them each layer should remove event create/destroy fanout without changing stream ordering.
- **Ceiling estimate**: The measured create/destroy total is about **58.6 ms** over the profile's decode window across all rank threads. The keep gate still requires **>=5 ms/token** p50 because API totals across ranks do not directly translate to request p50; this is only worthwhile if event allocation contributes to the launch/queue tail.
- **Keep/revert criterion**: keep only if local/H200 release builds pass, PPLX `output_len=64` generates all 64 tokens with p50 improved by **>=5 ms/token** and p95 <= **92 ms**. Revert on build failure, CUDA event/stream error, hang/timeout, wrong-looking output, insufficient p50 movement, or p95 regression.
-- **Result**: reverted. Local `git diff --check` and `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p pegainfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; H200 release build passed. H200 smoke `$RESULT_ROOT/pplx_reuse_events_olen8.log` generated 8/8 tokens with steady p50 **77.55 ms**. Full gate `$RESULT_ROOT/pplx_reuse_events_olen64.log` generated 64/64 tokens with first decode **149.52 ms**, steady avg **80.20 ms**, p50 **77.09 ms**, p95 **90.45 ms**, p99 **91.70 ms**, max **91.84 ms**; teardown hit the known status 139 segfault after metrics. The p50 stayed in the retained rows512 baseline band, so removing event create/destroy alone is not enough. Lesson: the event allocation calls show up in API totals, but the request p50 is governed by queued work and waits that remain after event reuse.
+- **Result**: reverted. Local `git diff --check` and `PATH=/usr/local/cuda/bin:$PATH cargo check --release -p openinfer-server --features deepseek-v4,pplx-ep --bin bench_serving` passed; H200 release build passed. H200 smoke `$RESULT_ROOT/pplx_reuse_events_olen8.log` generated 8/8 tokens with steady p50 **77.55 ms**. Full gate `$RESULT_ROOT/pplx_reuse_events_olen64.log` generated 64/64 tokens with first decode **149.52 ms**, steady avg **80.20 ms**, p50 **77.09 ms**, p95 **90.45 ms**, p99 **91.70 ms**, max **91.84 ms**; teardown hit the known status 139 segfault after metrics. The p50 stayed in the retained rows512 baseline band, so removing event create/destroy alone is not enough. Lesson: the event allocation calls show up in API totals, but the request p50 is governed by queued work and waits that remain after event reuse.
Current direct combine ctx-stream plus route-position experiment:
- **Target metric**: H200 rows512 PPLX `output_len=64` steady p50 should improve by at least **5 ms/token** from the retained **77-79 ms** baseline, p95 should stay <= **92 ms**, and all 64 tokens must be generated. NCCL path is untouched; rebuild validation is enough unless shared runtime code moves.
@@ -931,7 +931,7 @@ Current direct combine ctx-stream plus route-position experiment:
| Hadamard+FP4 single-pair fusion closes the ratio4 tail | killed | Fused serial kernel built and completed H200 smoke, but serving p95 regressed **184.00 -> 216.01 ms** while p50 stayed **144.03 ms**. | Re-open only for a larger fused ratio4 boundary operator that removes several launches and shows request-level p50/p95 improvement over >=16 decode steps. |
| Fine-grained local CUDA Graph islands can close the non-communication gap | killed as implemented | H200 `output_len=24` graph-island run completed but steady p50 regressed to **154.78 ms**. Decode-only profile shows 1056 graph instantiates totaling **944.08 ms**, **23232** graph launches totaling **421.25 ms**, launch-class calls/step down only about **8%**, and event wait/record counts essentially unchanged. | Re-open only for a larger static island that removes explicit stream handoff boundaries and graph-launch fanout, with a pre-written gate of at least 15 ms p50 improvement over >=16 decode steps. |
| Rank/a2a worker CPU overlap drives p95/max tail | alive, mitigation kept | H200 clean log had exact CPU overlap on CPU **6**; separating rank worker and pplx a2a worker pools cut `output_len=64` p95 **216.01 -> 159.96/162.86 ms** and max **300.01 -> 164.00/168.01 ms** across two runs, with p50 unchanged at **144.00 ms**. | Keep the pool split. Do not claim p50 progress from it; next p50 work needs a different mechanism. |
-| Rank0 TE/fabric worker on CPU0 drives the legacy p50 floor | alive, mitigation kept | `/proc` decode-window sampling before the fix showed CPU0 `tx_engine_domain` pinned correctly but only got **3602 ms** runtime over a **7012 ms** sample and took **2980** nonvoluntary switches, while fabric workers on CPU24/48/72 got ~**7008-7012 ms** runtime and **9-17** nonvoluntary switches. Moving rank0 TE off CPU0 produced two H200 `output_len=64` runs with steady p50 **66.46 / 66.70 ms**, p95 **69.80 / 69.62 ms**, max **71.48 / 71.89 ms**. The corrected `/proc` sample saw CPU10 `tx_engine_domain` runtime delta **3452 ms** and only **4** nonvoluntary switches. Logs: `$RESULT_ROOT/pplx_te_repin_olen64.log`, `$RESULT_ROOT/pplx_te_repin_olen64_r2.log`, `$RESULT_ROOT/pplx_te_repin_olen64_r2_proc_summary.txt`. Cleanup first introduced a per-rank placement plan, then a later review found topology-group role selection could collide with rank workers (`rank0 a2a/TE/UVM` on CPUs already used by rank1/2/3 workers). Current code moves the generic pieces to `pegainfer_core::cpu_topology`: read CUDA device NUMA, current affinity, and NUMA cpulist; split each NUMA pool into contiguous rank slices; reserve CPU0 for the system and CPU1 for scheduler; assign rank/a2a/TE/UVM roles from that rank's own slice; log `cpu_slice/rank_worker/TE/a2a/UVM` per rank at startup. H200 per-NUMA slice validation showed no CPU collision and measured `output_len=64` steady p50 **66.65 ms**, p95 **68.15 ms**, max **69.47 ms** before the known teardown segfault. | Keep CPU0/CPU1 reservation and per-NUMA rank slices. Validate future placement changes with startup logs, TPOT, and `/proc//sched` deltas; do not rely on topology-group CPU order alone. |
+| Rank0 TE/fabric worker on CPU0 drives the legacy p50 floor | alive, mitigation kept | `/proc` decode-window sampling before the fix showed CPU0 `tx_engine_domain` pinned correctly but only got **3602 ms** runtime over a **7012 ms** sample and took **2980** nonvoluntary switches, while fabric workers on CPU24/48/72 got ~**7008-7012 ms** runtime and **9-17** nonvoluntary switches. Moving rank0 TE off CPU0 produced two H200 `output_len=64` runs with steady p50 **66.46 / 66.70 ms**, p95 **69.80 / 69.62 ms**, max **71.48 / 71.89 ms**. The corrected `/proc` sample saw CPU10 `tx_engine_domain` runtime delta **3452 ms** and only **4** nonvoluntary switches. Logs: `$RESULT_ROOT/pplx_te_repin_olen64.log`, `$RESULT_ROOT/pplx_te_repin_olen64_r2.log`, `$RESULT_ROOT/pplx_te_repin_olen64_r2_proc_summary.txt`. Cleanup first introduced a per-rank placement plan, then a later review found topology-group role selection could collide with rank workers (`rank0 a2a/TE/UVM` on CPUs already used by rank1/2/3 workers). Current code moves the generic pieces to `openinfer_core::cpu_topology`: read CUDA device NUMA, current affinity, and NUMA cpulist; split each NUMA pool into contiguous rank slices; reserve CPU0 for the system and CPU1 for scheduler; assign rank/a2a/TE/UVM roles from that rank's own slice; log `cpu_slice/rank_worker/TE/a2a/UVM` per rank at startup. H200 per-NUMA slice validation showed no CPU collision and measured `output_len=64` steady p50 **66.65 ms**, p95 **68.15 ms**, max **69.47 ms** before the known teardown segfault. | Keep CPU0/CPU1 reservation and per-NUMA rank slices. Validate future placement changes with startup logs, TPOT, and `/proc//sched` deltas; do not rely on topology-group CPU order alone. |
| Route all-gather alone explains the 144 ms p50 floor | killed | Intra-process route exchange skipped `route_write_op + route_counter.wait` through a process-local barrier and direct peer `num_routed` pointer reads. H200 `output_len=64` completed all tokens but measured p50 **144.00 ms**, p95 **155.95 ms**, matching the CPU-pool p50 baseline. | Do not spend more patches on isolated route exchange. Re-open only as part of a full single-node state-machine replacement that reduces per-layer a2a union p50 by at least 20 ms. |
| Grouped FP4 unused capacity explains the 144 ms p50 floor | killed | bs=1 capacity clamp reduced theoretical pplx grouped row bound from **1376** to **560** rows and completed H200 `output_len=64`, but p50 remained **144.00 ms** with p95 **155.74 ms**. | Do not pursue host-side capacity clamps as a p50 fix. Re-open only with a profile proving grouped FP4 kernels, not a2a/driver synchronization, own at least 15 ms p50. |
| libcuda driver lock/contention explains non-communication operator spikes | alive as tail source, not p50 owner | Decode-only `--cudabacktrace=all:1000` + OSRT profile shows rank-thread `pthread_mutex_lock` total **782.16 ms** and max **108.95 ms**. The largest mutex stacks are `libcuda -> cuModuleGetFunction/cuLibraryGetModule -> cudaLaunchCooperativeKernel -> a2a_dispatch_send`, while non-communication `cudaLaunchKernel` tails show 16-30 ms API time for microsecond-scale kernels inside ratio4/HC ranges. NVTX-only `output_len=64` shows attention-local medians are sub-ms and do not explain the **144 ms** p50 floor. | Keep using API-vs-kernel profiles for tails. Do not spend more p50 experiments on single tiny operator launches unless a step-correlated profile shows at least 15 ms p50 ownership. |
@@ -978,9 +978,9 @@ Current direct combine ctx-stream plus route-position experiment:
- 同一窗口里其它设备经常还在跑 `a2a_dispatch_recv`、`a2a_combine_send`、`a2a_combine_recv`。这解释了为什么 direct-worker-mode 只把 wait 从 `combine_recv_done` 移到 `combine_send_done`:worker 仍然是每层本地 MoE readiness 的序列化观察者。
- single-node direct routed path 把 p50 从 **144.00 ms** 降到 **83.94 ms**,rows512 再降到 **77-79 ms**。这证明旧 floor 大部分来自 legacy worker/cooperative-kernel cadence,但该 path 绕过 upstream 四阶段语义,已从代码移除。
4. **当前保留实现回到 legacy four-stage**
- - direct routed path 的实验价值是定位机制,不是要把 bypass 留在 `pegainfer-comm`。现在保留的是 legacy `dispatch_send -> dispatch_recv -> combine_send -> combine_recv` 路径,以及 per-NUMA rank-slice placement 修复。
+ - direct routed path 的实验价值是定位机制,不是要把 bypass 留在 `openinfer-comm`。现在保留的是 legacy `dispatch_send -> dispatch_recv -> combine_send -> combine_recv` 路径,以及 per-NUMA rank-slice placement 修复。
- CPU placement 修正后,H200 `output_len=64` 两次复测 steady p50 **66.46 / 66.70 ms**、p95 **69.80 / 69.62 ms**,已接近 NCCL **63 ms** 级。剩余方向应先低侵入 profile 新 baseline,而不是继续维护 direct hack。
- - 新 placement 使用 `pegainfer_core::cpu_topology`,把 common CPU list parsing、affinity mask、thread pinning 和 CUDA-device NUMA lookup 从 dsv4 私有 helper 里抽出。本地 gate 覆盖 `cpu_topology` 单测、Rust/CUDA bridge 编译、格式和 diagnostic bench feature 编译;下一次 H200 profile 应基于这个 cleaned legacy path。
+ - 新 placement 使用 `openinfer_core::cpu_topology`,把 common CPU list parsing、affinity mask、thread pinning 和 CUDA-device NUMA lookup 从 dsv4 私有 helper 里抽出。本地 gate 覆盖 `cpu_topology` 单测、Rust/CUDA bridge 编译、格式和 diagnostic bench feature 编译;下一次 H200 profile 应基于这个 cleaned legacy path。
5. **把 grouped GEMM 的 host rows 上界收回来,但只能 GPU-only**
- 旧 GPU indptr 版本用 `expanded_input.seq_capacity()` 作为 grouped GEMM `rows`,会多跑空行的 act-quant / epilogue work。direct routed 后这个浪费重新变得可见,rows512 已实测带来 **5-7 ms/token** p50 收益。
- 每层 D2H 一个 padded-total 标量已实测会把 TPOT 拉坏到 1.54s/token;单独 compact/scatter 也已实测退化到 p50 **84.00 ms**。后续更进一步只能让 grouped path 原生接受 sparse/padded indptr,或把 compact 融进已有 kernel,不能新增两次 per-layer launch。
diff --git a/docs/models/deepseek-v4/prefix-paged-kv-pd-handoff.md b/docs/models/deepseek-v4/prefix-paged-kv-pd-handoff.md
index 09fbd73e..d5cb87ac 100644
--- a/docs/models/deepseek-v4/prefix-paged-kv-pd-handoff.md
+++ b/docs/models/deepseek-v4/prefix-paged-kv-pd-handoff.md
@@ -31,7 +31,7 @@ them.
### Direct KV Ownership
-`code-fact`: `pegainfer-deepseek-v4/src/direct/scheduler.rs` owns
+`code-fact`: `openinfer-deepseek-v4/src/direct/scheduler.rs` owns
`DirectKvCacheManager` and `DirectKvCacheLease`.
Current lifecycle:
@@ -78,7 +78,7 @@ P-D handoff, or transport-level handles.
### Communication Boundary
-`code-fact`: `pegainfer-comm` currently provides EP all-to-all public surface and
+`code-fact`: `openinfer-comm` currently provides EP all-to-all public surface and
opaque operation handles. It does not yet provide KV transfer or ownership
handoff primitives.
@@ -375,7 +375,7 @@ field names.
`derivation`: P-D handoff requires ownership handles, not transport objects. The
handle describes who owns cleanup and which observable signal proves transfer
completion or cancellation. It does not choose RDMA, IPC, serialization, or a
-specific `pegainfer-comm` operation.
+specific `openinfer-comm` operation.
### Export Side
@@ -520,7 +520,7 @@ Out of scope:
- prefix eviction performance tuning;
- production prefix cache policy;
- changing HTTP benchmark semantics;
-- replacing `pegainfer-comm` or adding KV transfer to it.
+- replacing `openinfer-comm` or adding KV transfer to it.
Merge criteria:
@@ -536,5 +536,5 @@ Merge criteria:
deliberately does not commit to a value.
- Decide whether prefix entries are rank-local only in v1 or require a
multi-rank consistency object.
-- Define a future `pegainfer-comm` KV-transfer extension only after allocator
+- Define a future `openinfer-comm` KV-transfer extension only after allocator
handles and cleanup semantics are proven locally.
diff --git a/docs/models/deepseek-v4/serving-baseline.md b/docs/models/deepseek-v4/serving-baseline.md
index 3255d70c..511af13c 100644
--- a/docs/models/deepseek-v4/serving-baseline.md
+++ b/docs/models/deepseek-v4/serving-baseline.md
@@ -19,7 +19,7 @@ Use this document as the baseline contract before changing the DeepSeek V4 sched
| Capability | Status | Evidence |
| --- | --- | --- |
-| DeepSeek V4 engine load behind the OpenAI HTTP facade | Available for smoke testing | `pegainfer-server --features deepseek-v4 --bin pegainfer` starts an OpenAI server for `$MODEL_DIR` on 8x RTX 5090 |
+| DeepSeek V4 engine load behind the OpenAI HTTP facade | Available for smoke testing | `openinfer-server --features deepseek-v4 --bin openinfer` starts an OpenAI server for `$MODEL_DIR` on 8x RTX 5090 |
| `/v1/models` | Available | The returned model id is the full model path: `$MODEL_DIR` |
| `/v1/completions` single-request greedy smoke | Available | Prompt `hello`, `max_tokens=4`, `temperature=0` returned a text completion and usage accounting |
| Direct single-request TPOT/hash regression | Available | `bench_serving request --prompt-len 1 --output-len 160 --warmup 2 --iters 3 --seed 42` is the retained DeepSeek V4 decode gate |
@@ -34,21 +34,21 @@ Run these commands from any checkout at or after PR #101's merge commit `d6d2cee
Build the HTTP server on the 5090 host:
```bash
-cd /path/to/pegainfer
+cd /path/to/openinfer
export PATH=/usr/local/cuda-13.1/bin:$PWD/.venv/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.1
-export PEGAINFER_TILELANG_PYTHON=$PWD/.venv/bin/python
-export PEGAINFER_TRITON_PYTHON=$PWD/.venv/bin/python
-export PEGAINFER_NVCC_JOBS=8
-export CARGO_TARGET_DIR=/path/to/pegainfer-target
+export OPENINFER_TILELANG_PYTHON=$PWD/.venv/bin/python
+export OPENINFER_TRITON_PYTHON=$PWD/.venv/bin/python
+export OPENINFER_NVCC_JOBS=8
+export CARGO_TARGET_DIR=/path/to/openinfer-target
-cargo build --release -p pegainfer-server --features deepseek-v4 --bin pegainfer
+cargo build --release -p openinfer-server --features deepseek-v4 --bin openinfer
```
Start the HTTP endpoint:
```bash
-$CARGO_TARGET_DIR/release/pegainfer \
+$CARGO_TARGET_DIR/release/openinfer \
--model-path $MODEL_DIR \
--port 18103
```
@@ -102,7 +102,7 @@ Observed smoke result:
Run the direct single-request decode regression gate:
```bash
-cargo run --release -p pegainfer-server \
+cargo run --release -p openinfer-server \
--bin bench_serving \
--features deepseek-v4 \
-- \
diff --git a/docs/models/deepseek-v4/support.md b/docs/models/deepseek-v4/support.md
index e924ec99..b490bbb7 100644
--- a/docs/models/deepseek-v4/support.md
+++ b/docs/models/deepseek-v4/support.md
@@ -9,8 +9,8 @@ This document is the single project record for the initial DeepSeek V4 PR. It re
The PR scope is:
-- add `pegainfer-deepseek-v4` as the model crate for the DeepSeek V4 Flash MP8 checkpoint;
-- wire DeepSeek V4 into `pegainfer-server` model detection and `bench_serving`;
+- add `openinfer-deepseek-v4` as the model crate for the DeepSeek V4 Flash MP8 checkpoint;
+- wire DeepSeek V4 into `openinfer-server` model detection and `bench_serving`;
- build official-style DeepSeek V4 TileLang kernels at compile time;
- keep runtime Python-free;
- provide exact text, operator, and HTTP service validation;
@@ -23,7 +23,7 @@ DeepSeek V4 currently requires the `deepseek-v4` Cargo feature and TileLang at b
The kernels build script probes:
-- `PEGAINFER_TILELANG_PYTHON`, if set;
+- `OPENINFER_TILELANG_PYTHON`, if set;
- `../.venv/bin/python`;
- `.venv/bin/python`;
- `python3`;
@@ -40,28 +40,28 @@ Minimal setup:
uv venv && source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu128
uv pip install "tilelang==0.1.9"
-export PEGAINFER_TILELANG_PYTHON=.venv/bin/python
+export OPENINFER_TILELANG_PYTHON=.venv/bin/python
```
-The generated CUDA is linked into `pegainfer-kernels` when the feature is enabled; Python is not needed at runtime.
+The generated CUDA is linked into `openinfer-kernels` when the feature is enabled; Python is not needed at runtime.
## Implementation Summary
### Model Crate
-`pegainfer-deepseek-v4` owns:
+`openinfer-deepseek-v4` owns:
- config parsing for DeepSeek V4 MP8;
- per-rank weight manifests and GPU loading;
- runtime ops for block prefill/decode, HC, sparse attention, routing, compressor state, and final logits;
- direct `EngineHandle` integration used by server and tests;
-- exact E2E tests driven by `test_data/deepseek-v4-ground-truth.json`, with `PEGAINFER_DEEPSEEK_GT_PATH` available for regenerations.
+- exact E2E tests driven by `test_data/deepseek-v4-ground-truth.json`, with `OPENINFER_DEEPSEEK_GT_PATH` available for regenerations.
The direct engine seeds decode cache from prompt prefill instead of replaying prompt tokens through decode. This made exact validation practical enough for PR use.
### TileLang Kernels
-`pegainfer-kernels/tools/tilelang/deepseek_v4/generate.py` generates CUDA sources for official-style DeepSeek V4 kernels:
+`openinfer-kernels/tools/tilelang/deepseek_v4/generate.py` generates CUDA sources for official-style DeepSeek V4 kernels:
- `act_quant_kernel`
- `fp8_gemm_kernel`
@@ -104,11 +104,11 @@ All 20 ground-truth cases pass exact text validation as four 5-case slices with
Command shape:
```bash
-PEGAINFER_DEEPSEEK_GT_OFFSET= \
-PEGAINFER_DEEPSEEK_GT_LIMIT=5 \
-PEGAINFER_DEEPSEEK_GT_MAX_NEW_TOKENS=64 \
-PEGAINFER_TEST_MODEL_PATH=models/DeepSeek-V4-Flash \
-cargo test --release -p pegainfer-deepseek-v4 --features deepseek-v4 --test e2e -- --nocapture --exact test_e2e_deepseek_v4_generation
+OPENINFER_DEEPSEEK_GT_OFFSET= \
+OPENINFER_DEEPSEEK_GT_LIMIT=5 \
+OPENINFER_DEEPSEEK_GT_MAX_NEW_TOKENS=64 \
+OPENINFER_TEST_MODEL_PATH=models/DeepSeek-V4-Flash \
+cargo test --release -p openinfer-deepseek-v4 --features deepseek-v4 --test e2e -- --nocapture --exact test_e2e_deepseek_v4_generation
```
### Operator Guards
@@ -116,9 +116,9 @@ cargo test --release -p pegainfer-deepseek-v4 --features deepseek-v4 --test e2e
The full DeepSeek V4 `mp8_manifest` release test passes:
```bash
-PEGAINFER_TEST_MODEL_PATH=$MODEL_DIR \
-PEGAINFER_NVCC_JOBS=8 \
-cargo test --release -p pegainfer-deepseek-v4 --features deepseek-v4 --test mp8_manifest -- --nocapture
+OPENINFER_TEST_MODEL_PATH=$MODEL_DIR \
+OPENINFER_NVCC_JOBS=8 \
+cargo test --release -p openinfer-deepseek-v4 --features deepseek-v4 --test mp8_manifest -- --nocapture
```
Result: `23 passed`, `0 failed`.
@@ -127,14 +127,14 @@ Coverage includes MP8 layout accessors, RoPE formula checks, TileLang FP8/FP4 li
### HTTP Service
-With `--features deepseek-v4`, `pegainfer-server` detects `model_type="deepseek_v4"` and starts DeepSeek V4 with eight devices and CUDA graph disabled.
+With `--features deepseek-v4`, `openinfer-server` detects `model_type="deepseek_v4"` and starts DeepSeek V4 with eight devices and CUDA graph disabled.
The initial service path is intentionally greedy-only. Requests that ask for sampling or logprobs are rejected before generation and surfaced through `stop_reason` instead of being silently coerced to greedy. This is a temporary compatibility choice in the vLLM frontend path; a later API cleanup should reject unsupported DeepSeek V4 request parameters during request validation instead of representing them as a completed generation.
Server command used for HTTP validation:
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --features deepseek-v4 -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-server --features deepseek-v4 -- \
--model-path $MODEL_DIR --port 18080
```
@@ -197,7 +197,7 @@ Earlier exact-request profiling removed the large synchronous `cudaMalloc/cudaFr
The current synthetic decode-heavy baseline on 5090-dev is:
```bash
-PEGAINFER_NVCC_JOBS=8 cargo run --release -p pegainfer-server --bin bench_serving --features deepseek-v4 -- \
+OPENINFER_NVCC_JOBS=8 cargo run --release -p openinfer-server --bin bench_serving --features deepseek-v4 -- \
--model-path $MODEL_DIR --format json \
request --prompt-len 1 --output-len 32 --warmup 1 --iters 1
```
@@ -219,12 +219,12 @@ The remaining TPOT problem is structural: decode still launches hundreds of thou
## Workspace Isolation
-DeepSeek V4 is a workspace member, but its DeepSeek-specific bins, integration tests, and `pegainfer-kernels/deepseek-v4` dependency are gated behind the `deepseek-v4` feature. This keeps default Qwen-oriented workspace checks from requiring TileLang.
+DeepSeek V4 is a workspace member, but its DeepSeek-specific bins, integration tests, and `openinfer-kernels/deepseek-v4` dependency are gated behind the `deepseek-v4` feature. This keeps default Qwen-oriented workspace checks from requiring TileLang.
Verified:
-- `PEGAINFER_NVCC_JOBS=8 cargo check --release --workspace` passed with DeepSeek TileLang disabled in `pegainfer-kernels`.
-- `cargo test --release --workspace --lib` passed with DeepSeek TileLang disabled in `pegainfer-kernels`. Qwen model-loading lib tests now skip only when `PEGAINFER_TEST_MODEL_PATH` is unset and the default local model directory is absent; explicitly provided model paths still run normally and fail normally if invalid.
+- `OPENINFER_NVCC_JOBS=8 cargo check --release --workspace` passed with DeepSeek TileLang disabled in `openinfer-kernels`.
+- `cargo test --release --workspace --lib` passed with DeepSeek TileLang disabled in `openinfer-kernels`. Qwen model-loading lib tests now skip only when `OPENINFER_TEST_MODEL_PATH` is unset and the default local model directory is absent; explicitly provided model paths still run normally and fail normally if invalid.
## Known Follow-ups
@@ -235,7 +235,7 @@ These are intentionally out of the initial PR scope:
- add arbitrary-value per-shape TileLang FP8/FP4 parity tests beyond the current power-of-two guards;
- profile final logits only if nsys shows it matters;
- profile NCCL all-reduce and TileLang FP4 GEMM after the initial PR lands;
-- narrow the current public diagnostic surface in `pegainfer-deepseek-v4` after bring-up bins/tests are either retired or moved behind a dedicated test-helper boundary;
+- narrow the current public diagnostic surface in `openinfer-deepseek-v4` after bring-up bins/tests are either retired or moved behind a dedicated test-helper boundary;
- move unsupported DeepSeek V4 request handling from generation-time `stop_reason` compatibility into frontend request validation;
- add an explicit non-panicking shutdown path for NCCL communicator teardown.
@@ -243,21 +243,21 @@ These are intentionally out of the initial PR scope:
Before opening the PR, keep the required gate focused:
-- `cargo fmt --check -p pegainfer-deepseek-v4`
-- `cargo check --release -p pegainfer-server`
-- `PEGAINFER_NVCC_JOBS=8 cargo check --release -p pegainfer-server --features deepseek-v4`
-- `PEGAINFER_TEST_MODEL_PATH=$MODEL_DIR PEGAINFER_NVCC_JOBS=8 cargo test --release -p pegainfer-deepseek-v4 --features deepseek-v4 --test mp8_manifest -- --nocapture`
+- `cargo fmt --check -p openinfer-deepseek-v4`
+- `cargo check --release -p openinfer-server`
+- `OPENINFER_NVCC_JOBS=8 cargo check --release -p openinfer-server --features deepseek-v4`
+- `OPENINFER_TEST_MODEL_PATH=$MODEL_DIR OPENINFER_NVCC_JOBS=8 cargo test --release -p openinfer-deepseek-v4 --features deepseek-v4 --test mp8_manifest -- --nocapture`
- four exact E2E slices over `test_data/deepseek-v4-ground-truth.json`, using:
```bash
-PEGAINFER_TEST_MODEL_PATH=$MODEL_DIR \
-PEGAINFER_DEEPSEEK_GT_OFFSET=<0|5|10|15> \
-PEGAINFER_DEEPSEEK_GT_LIMIT=5 \
-PEGAINFER_DEEPSEEK_GT_MAX_NEW_TOKENS=64 \
-PEGAINFER_NVCC_JOBS=8 \
-cargo test --release -p pegainfer-deepseek-v4 --features deepseek-v4 --test e2e -- --nocapture --exact test_e2e_deepseek_v4_generation
+OPENINFER_TEST_MODEL_PATH=$MODEL_DIR \
+OPENINFER_DEEPSEEK_GT_OFFSET=<0|5|10|15> \
+OPENINFER_DEEPSEEK_GT_LIMIT=5 \
+OPENINFER_DEEPSEEK_GT_MAX_NEW_TOKENS=64 \
+OPENINFER_NVCC_JOBS=8 \
+cargo test --release -p openinfer-deepseek-v4 --features deepseek-v4 --test e2e -- --nocapture --exact test_e2e_deepseek_v4_generation
```
-- one `/v1/completions` or `/v1/chat/completions` validation through `pegainfer-server`, plus one unsupported-parameter request checking `stop_reason`.
+- one `/v1/completions` or `/v1/chat/completions` validation through `openinfer-server`, plus one unsupported-parameter request checking `stop_reason`.
Broader workspace checks are valuable, but failures outside the DeepSeek V4 diff should be separated from this initial support PR.
diff --git a/docs/models/kimi-k2/accuracy-gate.md b/docs/models/kimi-k2/accuracy-gate.md
index ec25fdf4..9fd63164 100644
--- a/docs/models/kimi-k2/accuracy-gate.md
+++ b/docs/models/kimi-k2/accuracy-gate.md
@@ -1,6 +1,6 @@
# Kimi-K2 accuracy gate (vLLM-golden)
-**TL;DR**: `pegainfer-kimi-k2/tests/vllm_golden_gate.rs` + `test_data/kimi-k2.6-vllm-golden.safetensors` give Kimi-K2 its first accuracy gate reproducible from a fresh clone (#223). Reference is vLLM (same INT4 quantized model, marlin kernels), not HF. Two passes through the public serving path: teacher-forced argmax sweep (prefill numerics, regret rule + two-sided |Δlogprob| bound) and free-greedy decode parity (decode kernels, divergence-classified). The TP1/DP8 path emits exact per-token logprobs (#236), so the gate measures both engines' logprobs of the same token, like the Qwen gates. Needs 8 GPUs + K2.6 weights; fails loudly when prerequisites are missing.
+**TL;DR**: `openinfer-kimi-k2/tests/vllm_golden_gate.rs` + `test_data/kimi-k2.6-vllm-golden.safetensors` give Kimi-K2 its first accuracy gate reproducible from a fresh clone (#223). Reference is vLLM (same INT4 quantized model, marlin kernels), not HF. Two passes through the public serving path: teacher-forced argmax sweep (prefill numerics, regret rule + two-sided |Δlogprob| bound) and free-greedy decode parity (decode kernels, divergence-classified). The TP1/DP8 path emits exact per-token logprobs (#236), so the gate measures both engines' logprobs of the same token, like the Qwen gates. Needs 8 GPUs + K2.6 weights; fails loudly when prerequisites are missing.
Last touched: 2026-06
@@ -9,7 +9,7 @@ Last touched: 2026-06
Kimi-K2.6 is INT4 (compressed-tensors, pack-quantized). The general methodology
(`docs/subsystems/correctness/logits-golden-gate.md`) uses HF bf16 as golden —
for Kimi that is the wrong precision regime: HF decompresses INT4 to bf16 and
-runs dense GEMMs, while both vLLM and pegainfer execute the quantized model
+runs dense GEMMs, while both vLLM and openinfer execute the quantized model
through marlin-style INT4 kernels. vLLM is the closest equal-precision
reference, and the same box that runs the gate can regenerate the fixture
(vLLM 0.22.0 serves K2.6 out of the box).
@@ -27,7 +27,7 @@ asserts through the *real serving path* (DP coordinator → PPLX EP → MLA
kernels, TP1/DP8/EP8):
1. **Teacher-forced argmax sweep** (prefill numerics): for every tail position
- `i`, prefill `prompt + vllm_tail[..i]` with `max_tokens=1`. pegainfer's
+ `i`, prefill `prompt + vllm_tail[..i]` with `max_tokens=1`. openinfer's
pick must satisfy the flatness-scaled regret rule (see Tolerances): the
allowed distance below vLLM's argmax *in vLLM's own logprobs* grows with
vLLM's own uncertainty at that position. An aggregate exact-match floor
@@ -43,14 +43,14 @@ kernels, TP1/DP8/EP8):
bit-identical).
Both passes additionally bound the **two-sided |Δlogprob|** at exact-match
-positions — pegainfer's own logprob of the agreed token against vLLM's
+positions — openinfer's own logprob of the agreed token against vLLM's
stored one (mean + p99 per pass). Flip positions are excluded from that
population on purpose: their Δ is structurally larger (the engines disagree
about a flat distribution, which the regret rule already governs), and
mixing the populations parks the p99 on the boundary between them — the
same run-to-run straddling that killed fixed regret thresholds. Flip-pick
Δ is printed for observability. A per-position internal-consistency check
-(the pick's logprob must equal the head of pegainfer's own top-K) catches
+(the pick's logprob must equal the head of openinfer's own top-K) catches
GPU-argmax-vs-host-log-softmax disagreement on the same logits.
## Running it
@@ -62,16 +62,16 @@ GPU-argmax-vs-host-log-softmax disagreement on the same logits.
--out test_data/kimi-k2.6-vllm-golden.safetensors
# Run the gate (8 GPUs; vLLM must be stopped first — both need the full node):
-PEGAINFER_TEST_MODEL_PATH=/data/models/Kimi-K2.6 \
-cargo test -p pegainfer-kimi-k2 --features kimi-k2 --release \
+OPENINFER_TEST_MODEL_PATH=/data/models/Kimi-K2.6 \
+cargo test -p openinfer-kimi-k2 --features kimi-k2 --release \
--test vllm_golden_gate -- --nocapture
```
Build env on an H200/H20 node: `PATH` must include `/root/.cargo/bin` and
-`/usr/local/cuda/bin`, plus `PEGAINFER_CUDA_SM=90a` and
-`PEGAINFER_TRITON_PYTHON` (see `docs/models/kimi-k2/tp1-dp8-ep8-performance.md`).
+`/usr/local/cuda/bin`, plus `OPENINFER_CUDA_SM=90a` and
+`OPENINFER_TRITON_PYTHON` (see `docs/models/kimi-k2/tp1-dp8-ep8-performance.md`).
-There is no silent skip: missing `PEGAINFER_TEST_MODEL_PATH` or a missing
+There is no silent skip: missing `OPENINFER_TEST_MODEL_PATH` or a missing
fixture panics. (The qwen35 gate's env-gated skip silently reported
"ok 0.00s" — this gate deliberately does not.)
@@ -92,12 +92,12 @@ regret ≤ REGRET_BASE + REGRET_FLATNESS_SLOPE × (−vllm_top1_logprob)
= 0.30 + 0.35 × (−vllm_top1_logprob)
```
-where regret = how far pegainfer's pick sits below vLLM's argmax in vLLM's
+where regret = how far openinfer's pick sits below vLLM's argmax in vLLM's
own logprobs. At a confident position (top-1 ≈ 90%) the bound is ≈ 0.34 nat
— near-exact agreement; at a flat multi-modal position (top-1 ≈ 11%) it
reaches ≈ 1.07, because there is no single correct token for cross-engine
noise to deviate from. The bound depends only on the committed vLLM fixture,
-so pegainfer cannot influence its own tolerance.
+so openinfer cannot influence its own tolerance.
Calibration (three 8×H200 runs, 2026-06-05/06, vLLM 0.22.0 fixture):
~98% of positions match exactly in every pass; every cross-engine
diff --git a/docs/models/kimi-k2/bringup-history.md b/docs/models/kimi-k2/bringup-history.md
index ee28d80b..598d5e7e 100644
--- a/docs/models/kimi-k2/bringup-history.md
+++ b/docs/models/kimi-k2/bringup-history.md
@@ -38,11 +38,11 @@ K2.5 与 K2.6 文本架构相同,K2.6 是继续训练版;shape/TP8/EP8 规
- Marlin weight repack(`gptq_marlin_moe_repack`,no-actorder):checkpoint offset-binary `[expert,out,K/8] int32` → Marlin `uint4b8` bias=8 `[expert,K/16,N*2] int32`,总字节不变,**不做 `xor 0x88`**(保留 unsigned nibble)。
- Marlin scale permute(`marlin_moe_permute_scales`):checkpoint `[expert,out,in_group]` → group-major + 64-block `scale_perm` 的 `[expert,in_group,out]`。
- W13 必须在 load/package 阶段 fuse 成 `gate_then_up`(vLLM runtime ABI 不吃独立 gate/up):fused W13 int32 view `[48,448,8192]`,scale `[48,224,4096]`;W2 packed `[48,128,14336]`,scale `[48,64,7168]`。常驻 package 是 fused W13 + W2,gate/up 只是 load-time 临时 buffer。
-- 关键修复(Marlin atomic split-K):vLLM `fused_marlin_moe` 对 W13/W2 都用 `use_atomic_add=False, use_fp32_reduce=True`,走 global F32 `c_tmp` 累加。PegaInfer 早期固定 `use_atomic_add=true` 且不传 `c_tmp`,split-K>1 时 BF16 `atomicAdd` 写 C,累加顺序非确定 → row-state 发散。修复为预分配 `c_tmp` F32 + 关 atomic。H20 单层 W13/route_output/final 对 vLLM reference `max_diff=0 / mean_diff=0`(real K2.5 rank0 layer1)。
+- 关键修复(Marlin atomic split-K):vLLM `fused_marlin_moe` 对 W13/W2 都用 `use_atomic_add=False, use_fp32_reduce=True`,走 global F32 `c_tmp` 累加。OpenInfer 早期固定 `use_atomic_add=true` 且不传 `c_tmp`,split-K>1 时 BF16 `atomicAdd` 写 C,累加顺序非确定 → row-state 发散。修复为预分配 `c_tmp` F32 + 关 atomic。H20 单层 W13/route_output/final 对 vLLM reference `max_diff=0 / mean_diff=0`(real K2.5 rank0 layer1)。
### Router scale placement
-vLLM `grouped_topk` 返回 **未乘** `routed_scaling_factor` 的 normalized topk weights;`DeepseekV2MoE.forward` 在 routed expert 总输出后整体乘 `2.827`。PegaInfer 早期把 `2.827` 提前乘进 router topk weight 再喂 W2,rounding boundary 与 vLLM 不一致 → 已改为 router 输出 unscaled weights,routed F32 sum/reduce 后整体乘 scale。
+vLLM `grouped_topk` 返回 **未乘** `routed_scaling_factor` 的 normalized topk weights;`DeepseekV2MoE.forward` 在 routed expert 总输出后整体乘 `2.827`。OpenInfer 早期把 `2.827` 提前乘进 router topk weight 再喂 W2,rounding boundary 与 vLLM 不一致 → 已改为 router 输出 unscaled weights,routed F32 sum/reduce 后整体乘 scale。
### Tokenizer / prompt contract
@@ -52,11 +52,11 @@ vLLM `grouped_topk` 返回 **未乘** `routed_scaling_factor` 的 normalized top
## Removed / superseded (tombstones)
-- **Expert-major INT4 / CUTLASS example69 path — removed in #234.** Bring-up first built a CUTLASS example69 (Hopper INT4×BF16 grouped GEMM) probe as the routed-expert backend. A focused H20 probe proved it could not express Kimi's `group_size=32` per-K-group scale: example69 reloads scale on a 64-wide K tile (`TileShapeK=64`), so col `32/33` reused group0 scale and col `64` used group1 scale; `TileShapeK=32` hits CUTLASS static assert `K_BLOCK_MAX >= 4`. The path was demoted to a limitation probe and then deleted in #234 — the CUTLASS-era projection kernels/probe (`weight_packed_cutlass_example69`, `weight_shape` tensor loading, the example69 launcher and FFI) are gone. Marlin WNA16 is the only runtime INT4 path. `KimiExpertMajorProjectionPlan` (`pegainfer-kimi-k2/src/weights/package.rs`) remains **live** — it describes the EP weight layout, not the dead CUTLASS kernel. `KimiExpertMajorRoute` outlived its callers (DeepEP routing replaced it) and was deleted in the post-#298 dead-code sweep.
-- **`weight_shape` GPU load — removed in #234.** It was loaded for 60 MoE layers × 384 experts × 3 projections, validated to `[2]`, then never consumed by any kernel (dims come from manifest constants). Dropping it removes **8,640 tensors** from the load set (`pegainfer-kimi-k2/src/weights/tests.rs` asserts the count). The checkpoint still carries `weight_shape` on disk; the runtime simply no longer reads it.
-- **`KIMI_RUNNER_MAX_BATCH = 4` hard-cap — superseded.** Bring-up locked decode at a fixed bs4 wave. The const is now `64` (`pegainfer-kimi-k2/src/runner/scheduler.rs`), with worker decode arenas bucketed `[1, 2, 4, 8, 16, 32, 64]` (`KIMI_DECODE_BATCH_BUCKETS`, `worker.rs`) and per-request cap `KIMI_MAX_REQUEST_TOKENS = 8192` (DP prompt cap `PPLX_MAX_DISPATCH_TOKENS = 2048`). Changing the cap is not a one-const edit: it ties arena count, every `decode_batch_size`-shaped scratch/router/Marlin shape, and per-bucket CUDA-graph capture.
-- **`kimi-k2-pplx-ep` cargo feature + `PEGAINFER_KIMI_PARALLEL` env — removed.** Parallel shape and EP backend are now CLI flags: `--tp-size/--dp-size/--ep-backend`. The feature is just `kimi-k2`. Active line: `--tp-size 1 --dp-size 8 --ep-backend pplx`.
-- **Internal H20 smoke/candidate/debug test entries — removed.** Direct worker/scheduler no longer carries `forward_prompt_smoke`, `ForwardOneTokenSmoke`, full-decode smoke, row-diff D2H instrumentation, or candidate-dump tests; only CPU unit tests (placement, page metadata) remain. Progress is gated end-to-end through `pegainfer-server` / `bench_serving` / OpenAI `/v1/completions`.
+- **Expert-major INT4 / CUTLASS example69 path — removed in #234.** Bring-up first built a CUTLASS example69 (Hopper INT4×BF16 grouped GEMM) probe as the routed-expert backend. A focused H20 probe proved it could not express Kimi's `group_size=32` per-K-group scale: example69 reloads scale on a 64-wide K tile (`TileShapeK=64`), so col `32/33` reused group0 scale and col `64` used group1 scale; `TileShapeK=32` hits CUTLASS static assert `K_BLOCK_MAX >= 4`. The path was demoted to a limitation probe and then deleted in #234 — the CUTLASS-era projection kernels/probe (`weight_packed_cutlass_example69`, `weight_shape` tensor loading, the example69 launcher and FFI) are gone. Marlin WNA16 is the only runtime INT4 path. `KimiExpertMajorProjectionPlan` (`openinfer-kimi-k2/src/weights/package.rs`) remains **live** — it describes the EP weight layout, not the dead CUTLASS kernel. `KimiExpertMajorRoute` outlived its callers (DeepEP routing replaced it) and was deleted in the post-#298 dead-code sweep.
+- **`weight_shape` GPU load — removed in #234.** It was loaded for 60 MoE layers × 384 experts × 3 projections, validated to `[2]`, then never consumed by any kernel (dims come from manifest constants). Dropping it removes **8,640 tensors** from the load set (`openinfer-kimi-k2/src/weights/tests.rs` asserts the count). The checkpoint still carries `weight_shape` on disk; the runtime simply no longer reads it.
+- **`KIMI_RUNNER_MAX_BATCH = 4` hard-cap — superseded.** Bring-up locked decode at a fixed bs4 wave. The const is now `64` (`openinfer-kimi-k2/src/runner/scheduler.rs`), with worker decode arenas bucketed `[1, 2, 4, 8, 16, 32, 64]` (`KIMI_DECODE_BATCH_BUCKETS`, `worker.rs`) and per-request cap `KIMI_MAX_REQUEST_TOKENS = 8192` (DP prompt cap `PPLX_MAX_DISPATCH_TOKENS = 2048`). Changing the cap is not a one-const edit: it ties arena count, every `decode_batch_size`-shaped scratch/router/Marlin shape, and per-bucket CUDA-graph capture.
+- **`kimi-k2-pplx-ep` cargo feature + `OPENINFER_KIMI_PARALLEL` env — removed.** Parallel shape and EP backend are now CLI flags: `--tp-size/--dp-size/--ep-backend`. The feature is just `kimi-k2`. Active line: `--tp-size 1 --dp-size 8 --ep-backend pplx`.
+- **Internal H20 smoke/candidate/debug test entries — removed.** Direct worker/scheduler no longer carries `forward_prompt_smoke`, `ForwardOneTokenSmoke`, full-decode smoke, row-diff D2H instrumentation, or candidate-dump tests; only CPU unit tests (placement, page metadata) remain. Progress is gated end-to-end through `openinfer-server` / `bench_serving` / OpenAI `/v1/completions`.
## Chronology (decision records)
@@ -82,10 +82,10 @@ The bring-up ran ~2026-05-20 to 2026-05-22 on an 8×H200 node against a vLLM `0.
## Reference tooling (off-repo fixtures)
-- `pegainfer-kernels/tools/kimi_k2/hf_logits_reference.py` — HF raw full-logits reference (trust_remote_code + vision-tower stub; INT4-only reference, slow run_compressed load).
-- `pegainfer-kernels/tools/kimi_k2/vllm_logits_reference.py` — vLLM serving top-logprob fixture (vLLM 0.19.0 caps sample `logprobs` at 20, so the bring-up gate used top-20). Supports `--prompt-set-json` for batched multi-prompt cases.
-- `pegainfer-kernels/tools/kimi_k2/vllm_marlin_wna16_reference.py` — vLLM Marlin W13 / W2 / final BF16 reference; `--model-path ... --layer-idx 1 --rank 0` reads the real checkpoint's rank-local experts.
-- `pegainfer-kernels/tools/kimi_k2/compare_vllm_topk_fixture.py` / `compare_logits_fixture.py` — candidate comparison (argmax / top-k overlap / logits diff).
-- `pegainfer-kernels/tools/kimi_k2/torch_reference.py` — compressed-tensors official pack/dequant, bit-exact INT4 single-expert fixture (self-check `0-diff`).
+- `openinfer-kernels/tools/kimi_k2/hf_logits_reference.py` — HF raw full-logits reference (trust_remote_code + vision-tower stub; INT4-only reference, slow run_compressed load).
+- `openinfer-kernels/tools/kimi_k2/vllm_logits_reference.py` — vLLM serving top-logprob fixture (vLLM 0.19.0 caps sample `logprobs` at 20, so the bring-up gate used top-20). Supports `--prompt-set-json` for batched multi-prompt cases.
+- `openinfer-kernels/tools/kimi_k2/vllm_marlin_wna16_reference.py` — vLLM Marlin W13 / W2 / final BF16 reference; `--model-path ... --layer-idx 1 --rank 0` reads the real checkpoint's rank-local experts.
+- `openinfer-kernels/tools/kimi_k2/compare_vllm_topk_fixture.py` / `compare_logits_fixture.py` — candidate comparison (argmax / top-k overlap / logits diff).
+- `openinfer-kernels/tools/kimi_k2/torch_reference.py` — compressed-tensors official pack/dequant, bit-exact INT4 single-expert fixture (self-check `0-diff`).
A strict bit-level `h20_kimi_marlin_wna16_single_layer_matches_vllm_reference` gate is kept `#[ignore]` (red by design): vLLM Marlin's W2 atomic split-K accumulation order gives `route_output max_diff=96 / mean_diff=1.86` at BF16 magnitude ~7000 (< 0.03% relative, ~1.5 ULP) — not an algorithm bug. Turning it green requires either a `use_fp32_reduce` fixture or a ULP-relative tolerance.
diff --git a/docs/models/kimi-k2/deepep-migration.md b/docs/models/kimi-k2/deepep-migration.md
index da4fe9e6..e75b23d3 100644
--- a/docs/models/kimi-k2/deepep-migration.md
+++ b/docs/models/kimi-k2/deepep-migration.md
@@ -1,13 +1,13 @@
# Kimi-K2 MoE EP: PPLX → DeepEP migration
-> **TL;DR:** Implemented and 8×H200-verified — Kimi-K2's MoE EP backend is now DeepEP (elastic API, AOT-instantiated, no torch/NVRTC/NVSHMEM); PPLX is fully deleted from the kimi path (`moe_pplx.rs` gone, kimi crate no longer depends on `pegainfer-comm`). Decode = `do_expand=true` + `do_cpu_sync=false`: fixed worst-case buffers, zero host syncs/allocs per step → **CUDA graph capture enabled (#227): bs64 steady TPOT p50 26.03 ms vs 29.61 eager (−12%), replay only at full bucket occupancy**. Prefill = `do_cpu_sync=true` with host spin on pinned counters. Marlin consumes the DeepEP recv buffer **in place** (expert_alignment 8 == Marlin block size; identity routing + sentinels). Same-node A/B vs PPLX on hth200-29: **eager bs64 TPOT p50 29.61 vs 29.79 ms (parity), comm itself 7µs/layer faster**; golden gate equivalent to main (free-greedy near-tie reds on both backends, teacher-forced 0 violations both). The initial port was +14% TPOT slower until two capacity-proportional adapter kernels were fixed (see the lesson below).
+> **TL;DR:** Implemented and 8×H200-verified — Kimi-K2's MoE EP backend is now DeepEP (elastic API, AOT-instantiated, no torch/NVRTC/NVSHMEM); PPLX is fully deleted from the kimi path (`moe_pplx.rs` gone, kimi crate no longer depends on `openinfer-comm`). Decode = `do_expand=true` + `do_cpu_sync=false`: fixed worst-case buffers, zero host syncs/allocs per step → **CUDA graph capture enabled (#227): bs64 steady TPOT p50 26.03 ms vs 29.61 eager (−12%), replay only at full bucket occupancy**. Prefill = `do_cpu_sync=true` with host spin on pinned counters. Marlin consumes the DeepEP recv buffer **in place** (expert_alignment 8 == Marlin block size; identity routing + sentinels). Same-node A/B vs PPLX on hth200-29: **eager bs64 TPOT p50 29.61 vs 29.79 ms (parity), comm itself 7µs/layer faster**; golden gate equivalent to main (free-greedy near-tie reds on both backends, teacher-forced 0 violations both). The initial port was +14% TPOT slower until two capacity-proportional adapter kernels were fixed (see the lesson below).
>
> **Last touched:** 2026-06
## Architecture as built
```
-pegainfer-kernels/
+openinfer-kernels/
third_party/DeepEP # submodule d4f41e4 (2026-05-26)
csrc/deepep/deepep_shim.cu # AOT template instantiation (Kimi config baked:
# 384 experts / 48 local, topk 8, hidden 7168, 8 ranks)
@@ -15,20 +15,20 @@ pegainfer-kernels/
src/ffi/deepep.rs # repr(C) DeepEpInfo + extern decls
src/ops/deepep.rs # DeepEp wrapper: decode_dispatch/decode_combine (no sync),
# prefill_dispatch_send/wait_counts/recv + prefill_combine
-pegainfer-kimi-k2/
+openinfer-kimi-k2/
src/runner/moe_deepep.rs # the MoE layer:
# forward_moe_layer_decode_deepep_normed (host-quiet)
# forward_moe_layer_prefill_deepep (cpu-sync)
```
-Build needs `PEGAINFER_NCCL_ROOT` pointing at NCCL ≥ 2.30 (device API: `ncclDevComm`,
+Build needs `OPENINFER_NCCL_ROOT` pointing at NCCL ≥ 2.30 (device API: `ncclDevComm`,
windows, GIN). The binary links `libnccl.so.2` via `LD_LIBRARY_PATH` at runtime.
-Local dev: `PEGAINFER_NCCL_ROOT=/data/opt/nccl-2.30.4`.
+Local dev: `OPENINFER_NCCL_ROOT=/data/opt/nccl-2.30.4`.
Backend selection: TP1/DP8 **requires** `--ep-backend=deepep` (default), TP8/DP1
requires `nccl` — both enforced with `ensure!` in `runner/bringup.rs`. There is no
-PPLX fallback by design ("我们并不是很喜欢 pplx ep"). `pegainfer-comm`/PPLX survive
-only for the deepseek crates, which use their own `pegainfer_comm::EpBackend` type.
+PPLX fallback by design ("我们并不是很喜欢 pplx ep"). `openinfer-comm`/PPLX survive
+only for the deepseek crates, which use their own `openinfer_comm::EpBackend` type.
## The contracts the integration stands on (verified in upstream source)
@@ -161,7 +161,7 @@ Node env facts (also apply to other hth200 nodes until proven otherwise):
plugin loads but deeper init fails without DOCA GPUNetIO; intranode traffic
is NVLink windows, GIN is inter-node-only.
- System NCCL is exactly 2.30.4 (`/usr/include` + `/usr/lib/x86_64-linux-gnu`);
- `PEGAINFER_NCCL_ROOT` wants the `include/`+`lib/` layout — a symlink tree at
+ `OPENINFER_NCCL_ROOT` wants the `include/`+`lib/` layout — a symlink tree at
`/data/opt/nccl-2.30.4` bridges it.
- The bastion swallows ssh exit codes — poll remote jobs with output markers,
never `$?`. `pkill -f ` self-matches the ssh wrapper command line —
diff --git a/docs/models/kimi-k2/dp-design.md b/docs/models/kimi-k2/dp-design.md
index 52e47d40..91331ec6 100644
--- a/docs/models/kimi-k2/dp-design.md
+++ b/docs/models/kimi-k2/dp-design.md
@@ -31,7 +31,7 @@ local_experts = 384 / ep_world (按 ep_world 切)
```rust
/// 纯并行拓扑,跟模型无关。可复用于 DSV4、Qwen 等。
-/// 放 pegainfer-core。
+/// 放 openinfer-core。
pub struct ParallelConfig {
pub tp_world: usize,
pub dp_world: usize,
@@ -39,7 +39,7 @@ pub struct ParallelConfig {
}
/// 一个 rank 在 TP×DP×EP 网格中的坐标。
-/// 放 pegainfer-core。
+/// 放 openinfer-core。
pub struct RankCoord {
pub global_rank: usize,
pub tp_rank: usize, // global_rank % tp_world
@@ -48,7 +48,7 @@ pub struct RankCoord {
}
/// Kimi-K2 专属:从拓扑派生的模型维度。
-/// 现有 KimiK2ParallelShape 的延续,留在 pegainfer-kimi-k2。
+/// 现有 KimiK2ParallelShape 的延续,留在 openinfer-kimi-k2。
pub struct KimiK2ModelConfig {
pub topo: ParallelConfig,
pub heads_per_tp: usize, // = 64 / tp_world
diff --git a/docs/models/kimi-k2/kv-cache-design.md b/docs/models/kimi-k2/kv-cache-design.md
index cfdb4b0c..803e9e03 100644
--- a/docs/models/kimi-k2/kv-cache-design.md
+++ b/docs/models/kimi-k2/kv-cache-design.md
@@ -20,7 +20,7 @@ after burning up to 2047 tokens of compute. Nothing validated
### 2. Kimi KV is already paged — the kernels need nothing
-`KimiMlaPagedKvLayout` (`pegainfer-kernels/src/ops/kimi_k2/mla.rs`), page
+`KimiMlaPagedKvLayout` (`openinfer-kernels/src/ops/kimi_k2/mla.rs`), page
table buffers (`page_indices_d/page_indptr_d/last_page_len_d`), a paged append
kernel (`kimi_mla_paged_kv_append`) and a paged MLA decode kernel
(`kimi_flashinfer_batch_decode_mla`) were all in production. The "fixed arena"
@@ -39,12 +39,12 @@ threads explicit `positions_d`, so the fix shape is cleaner than qwen3's was.
## Design (as landed)
-### Logical/physical split: `BlockPool` in `pegainfer-kv-cache`
+### Logical/physical split: `BlockPool` in `openinfer-kv-cache`
The qwen3 `KvCacheManager` was split so MLA models reuse the logical layer
without inheriting the full-attention physical layout:
-- **`BlockPool`** (`pegainfer-kv-cache/src/pool.rs`): kvbm `BlockManager` +
+- **`BlockPool`** (`openinfer-kv-cache/src/pool.rs`): kvbm `BlockManager` +
the reserved padding block + `RequestKv` (the `SchedulableSequence` wrapper:
`schedule_prefill/apply_prefill/schedule_decode/apply_decode`, RAII release).
Owns **no GPU memory** — it hands out block IDs.
@@ -84,7 +84,7 @@ it within 16 decode steps). Handing that raw list to the worker trips the
exact-match check above. Every page row given to a forward pass must come
from `RequestKv::step_page_indices(new_tokens)`, which trims to
`ceil((kv_position + new_tokens)/16)`; a regression test in
-`pegainfer-kv-cache/src/pool.rs` sweeps prompt lengths × decode steps and
+`openinfer-kv-cache/src/pool.rs` sweeps prompt lengths × decode steps and
self-retires if kvbm stops over-allocating.
Why it surfaced as a *hang*, not an error: on DP the owning rank's
diff --git a/docs/models/kimi-k2/optimization.md b/docs/models/kimi-k2/optimization.md
index bdbe58ea..13b1d2c6 100644
--- a/docs/models/kimi-k2/optimization.md
+++ b/docs/models/kimi-k2/optimization.md
@@ -6,7 +6,7 @@
## Goal
-PegaInfer Kimi-K2 端到端延迟和吞吐在同 H20 ×8 配置上达到或超过 vLLM 0.19.0 baseline,并保留 greedy token-id parity 作为 keep/revert 硬 gate。**当前重点是 decode 性能**,prefill 与 decode 主线并行改,但不优先。
+OpenInfer Kimi-K2 端到端延迟和吞吐在同 H20 ×8 配置上达到或超过 vLLM 0.19.0 baseline,并保留 greedy token-id parity 作为 keep/revert 硬 gate。**当前重点是 decode 性能**,prefill 与 decode 主线并行改,但不优先。
阶段路线(前两步已落地,TP1+DP8+EP8 是当前 active line):
@@ -29,13 +29,13 @@ PegaInfer Kimi-K2 端到端延迟和吞吐在同 H20 ×8 配置上达到或超
## E2E Dashboard(TP8+EP8 历史 bring-up 口径)
-> 这一节是 TP8+EP8 NCCL graph 路径的历史 dashboard,concurrency 锁在 bs4。它记录的是 bring-up 阶段的 keep/revert gate,不是当前 serving cap。**当前 active line(TP1+DP8+EP8)decode batch cap 是 64**,bucketed `[1,2,4,8,16,32,64]`(`KIMI_RUNNER_MAX_BATCH = 64`,`pegainfer-kimi-k2/src/runner/scheduler.rs`),bs64 service 数据见 [tp1-dp8-ep8-performance.md](tp1-dp8-ep8-performance.md) / [roadmap.md](roadmap.md)。
+> 这一节是 TP8+EP8 NCCL graph 路径的历史 dashboard,concurrency 锁在 bs4。它记录的是 bring-up 阶段的 keep/revert gate,不是当前 serving cap。**当前 active line(TP1+DP8+EP8)decode batch cap 是 64**,bucketed `[1,2,4,8,16,32,64]`(`KIMI_RUNNER_MAX_BATCH = 64`,`openinfer-kimi-k2/src/runner/scheduler.rs`),bs64 service 数据见 [tp1-dp8-ep8-performance.md](tp1-dp8-ep8-performance.md) / [roadmap.md](roadmap.md)。
-GPU: 8× NVIDIA H20。Model: Kimi-K2.5 (Kimi-K2.6 同架构权重,K2.5 是当时 H20 验证路径)。vLLM: 0.19.0。**vLLM 是 TP1+DP8+EP8 形态**,跟当时 pegainfer 的 TP8+EP8 形态不同——这不是 apples-to-apples,是两条不同 sharding 路线在同硬件下的 baseline 对照(参见 [vllm-h20-baseline.md](vllm-h20-baseline.md))。
+GPU: 8× NVIDIA H20。Model: Kimi-K2.5 (Kimi-K2.6 同架构权重,K2.5 是当时 H20 验证路径)。vLLM: 0.19.0。**vLLM 是 TP1+DP8+EP8 形态**,跟当时 openinfer 的 TP8+EP8 形态不同——这不是 apples-to-apples,是两条不同 sharding 路线在同硬件下的 baseline 对照(参见 [vllm-h20-baseline.md](vllm-h20-baseline.md))。
-In-process bench(pegainfer 自带 `bench_serving request`):
+In-process bench(openinfer 自带 `bench_serving request`):
-| Profile | Metric | pegainfer | 备注 |
+| Profile | Metric | openinfer | 备注 |
| --- | --- | --- | --- |
| short-prompt streaming (~30 tok in, free out) | TTFT | `1995.5ms` | HTTP `/v1/completions` 端到端 |
| short-prompt streaming (~30 tok in, free out) | TPOT | `14.48ms` (≈30.8 tok/s) | HTTP |
@@ -46,11 +46,11 @@ In-process bench(pegainfer 自带 `bench_serving request`):
HTTP bench 同 client(`vllm bench serve`),decode-heavy profile(input=1, output=128, ignore-eos, bs=4):
-| Metric | pegainfer TP8+EP8 | vLLM TP1+DP8+EP8 | delta |
+| Metric | openinfer TP8+EP8 | vLLM TP1+DP8+EP8 | delta |
| --- | ---: | ---: | --- |
-| TPOT median | `19.13ms` | `24.97ms` | pegainfer −23% |
-| TPOT p99 | `23.63ms` | `29.46ms` | pegainfer −20% |
-| ITL median | `17.42ms` | `23.02ms` | pegainfer −24% |
+| TPOT median | `19.13ms` | `24.97ms` | openinfer −23% |
+| TPOT p99 | `23.63ms` | `29.46ms` | openinfer −20% |
+| ITL median | `17.42ms` | `23.02ms` | openinfer −24% |
| TTFT median | `313.10ms` | `69.60ms` | **vLLM 4.5× 更低** |
| TTFT p99 | `4239.97ms` | `135.40ms` | **vLLM 31× 更低** |
| Output tok/s | `159.99` | `157.94` | 同量级 |
@@ -59,7 +59,7 @@ HTTP bench 同 client(`vllm bench serve`),decode-heavy profile(input=1,
- in-process bench 来自 `target/release/bench_serving request --cuda-graph true ...`,已过四并发 vLLM fixture greedy gate,不会被 prompt prefill 吃掉。
- 短 prompt streaming TTFT 是 OpenAI-compatible `/v1/completions` 端到端窗口(含 first-collective stream drain、scheduler、frontend),不是纯 prefill GPU time;prefill 阶段拆分还没开始(见 Open 章节)。
-- HTTP bench 是用同一份 `vllm bench serve --backend openai --endpoint /v1/completions` 分别打 pegainfer 和 vLLM server,保证 client / metric 定义一致。vLLM TP1+DP8+EP8 完整 bs 1..256 扫描见 [vllm-h20-baseline.md](vllm-h20-baseline.md)。
+- HTTP bench 是用同一份 `vllm bench serve --backend openai --endpoint /v1/completions` 分别打 openinfer 和 vLLM server,保证 client / metric 定义一致。vLLM TP1+DP8+EP8 完整 bs 1..256 扫描见 [vllm-h20-baseline.md](vllm-h20-baseline.md)。
- **HTTP 19.13 vs in-process 14.39 差 4.74ms / token,~33% overhead** —— frontend / streaming 不该这么多,已记录到 Open 章节作为独立查询项。
## Architecture
@@ -273,7 +273,7 @@ Marlin 数字是 synthetic all-local route 假设,不是真实 EP8 全局 rout
**Bottleneck:** H20 固定 4 并发 fixture `max_tokens=16` 时 row1 偶发输出 `[1008,2742,924,6454,...]`(应为 `[1008,2742,2531,414,...]`)。Per-phase row first-diff 把切点收缩到 layer1 routed expert path,最早是 `moe_w13_out`。
-**Root cause:** PegaInfer Marlin WNA16 wrapper 固定 `use_atomic_add=true` 且没传 `c_tmp`。当 split-K > 1 时,kernel 用 BF16 `atomicAdd` 直接累加进 output C;BF16 atomic 在 H20 上对累加顺序敏感,rank/token 之间的非确定性 ordering 把 row state 弄花。vLLM 自己的 `fused_marlin_moe.py` 对 W13 和 W2 都传 `use_atomic_add=False, use_fp32_reduce=True`,走 global F32 `c_tmp` 累加。
+**Root cause:** OpenInfer Marlin WNA16 wrapper 固定 `use_atomic_add=true` 且没传 `c_tmp`。当 split-K > 1 时,kernel 用 BF16 `atomicAdd` 直接累加进 output C;BF16 atomic 在 H20 上对累加顺序敏感,rank/token 之间的非确定性 ordering 把 row state 弄花。vLLM 自己的 `fused_marlin_moe.py` 对 W13 和 W2 都传 `use_atomic_add=False, use_fp32_reduce=True`,走 global F32 `c_tmp` 累加。
**Approach:** worker / decode arena 预分配 `c_tmp` F32 buffer,Marlin launch 切到 vLLM 的 global-reduce 路径(`use_atomic_add=false`),output / locks 在 step 边界 zero-fill。
diff --git a/docs/models/kimi-k2/pplx-ep-correctness.md b/docs/models/kimi-k2/pplx-ep-correctness.md
index 97cba5f9..07c718c7 100644
--- a/docs/models/kimi-k2/pplx-ep-correctness.md
+++ b/docs/models/kimi-k2/pplx-ep-correctness.md
@@ -24,7 +24,7 @@ Target comparison:
> CLI note: the parallel shape and EP backend are selected by the
> `--tp-size/--dp-size/--ep-backend` flags. The old `kimi-k2-pplx-ep` cargo
-> feature and `PEGAINFER_KIMI_PARALLEL` env (used in the original 2026-05-25 run)
+> feature and `OPENINFER_KIMI_PARALLEL` env (used in the original 2026-05-25 run)
> have been removed; the feature is now just `kimi-k2`.
TP1/DP8 PPLX is intentionally not the baseline for this document. The current
@@ -34,9 +34,9 @@ repair first makes TP8/DP1 PPLX match TP8/DP1 NCCL.
| Date | Path | Output | Result |
| --- | --- | --- | --- |
-| 2026-05-25 | `cargo check --release -p pegainfer-server --features kimi-k2 --bin bench_serving` (PPLX selected via `--ep-backend pplx`) | clean build on 8×H200 node | Pass |
-| 2026-05-25 | `cargo check --release -p pegainfer-server --features kimi-k2 --bin bench_serving` | clean build on 8×H200 node | Pass |
-| 2026-05-25 | `cargo test --release -p pegainfer-comm --test pplx_roundtrip -- --nocapture` | 8 ranks dispatch+combine roundtrip, each rank received 512 tokens | Pass |
+| 2026-05-25 | `cargo check --release -p openinfer-server --features kimi-k2 --bin bench_serving` (PPLX selected via `--ep-backend pplx`) | clean build on 8×H200 node | Pass |
+| 2026-05-25 | `cargo check --release -p openinfer-server --features kimi-k2 --bin bench_serving` | clean build on 8×H200 node | Pass |
+| 2026-05-25 | `cargo test --release -p openinfer-comm --test pplx_roundtrip -- --nocapture` | 8 ranks dispatch+combine roundtrip, each rank received 512 tokens | Pass |
| 2026-05-25 | TP8 PPLX bs4, output 5, iters 3 | `$RESULT_ROOT/kimi_pplx_tp8_bs4_o5_final.json`: 12/12 traces hash `7c4c5d83355198fd` | Pass |
| 2026-05-25 | TP8 NCCL bs64 active decode | `$RESULT_ROOT/kimi_nccl_tp8_active64_o5_final.json`: `Counter({'7c4c5d83355198fd': 32, '9eecc1ca6fb3409d': 32})`, steady TPOT p50 `97.53ms` | Reference |
| 2026-05-25 | TP8 PPLX bs64 active decode | `$RESULT_ROOT/kimi_pplx_tp8_active64_o5_after_review.json`: `Counter({'7c4c5d83355198fd': 32, '9eecc1ca6fb3409d': 32})`, steady TPOT p50 `110.14ms` | Matches NCCL |
@@ -51,18 +51,18 @@ per-index trace equality between PPLX and NCCL for the same active scheduling.
Common environment:
```bash
-cd $PEGAINFER_DIR
+cd $OPENINFER_DIR
export CUDA_HOME=/usr/local/cuda
export NVCC=/usr/local/cuda/bin/nvcc
-export LD_LIBRARY_PATH=$RESULT_ROOT/pegainfer-nccl-lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
-export PEGAINFER_CUDA_SM=90a
-export PEGAINFER_TRITON_PYTHON=$PEGAINFER_DIR/.triton-venv/bin/python
+export LD_LIBRARY_PATH=$RESULT_ROOT/openinfer-nccl-lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
+export OPENINFER_CUDA_SM=90a
+export OPENINFER_TRITON_PYTHON=$OPENINFER_DIR/.triton-venv/bin/python
```
NCCL reference (TP8/DP1):
```bash
-cargo run --quiet --release -p pegainfer-server --features kimi-k2 --bin bench_serving -- \
+cargo run --quiet --release -p openinfer-server --features kimi-k2 --bin bench_serving -- \
--model-path $MODEL_DIR \
--tp-size 8 --dp-size 1 --ep-backend nccl \
--cuda-graph false \
@@ -74,7 +74,7 @@ cargo run --quiet --release -p pegainfer-server --features kimi-k2 --bin bench_s
PPLX path (TP8/DP1):
```bash
-cargo run --quiet --release -p pegainfer-server --features kimi-k2 --bin bench_serving -- \
+cargo run --quiet --release -p openinfer-server --features kimi-k2 --bin bench_serving -- \
--model-path $MODEL_DIR \
--tp-size 8 --dp-size 1 --ep-backend pplx \
--cuda-graph false \
diff --git a/docs/models/kimi-k2/pplx-ep-decode.md b/docs/models/kimi-k2/pplx-ep-decode.md
index 1ce70ac5..f28ef93c 100644
--- a/docs/models/kimi-k2/pplx-ep-decode.md
+++ b/docs/models/kimi-k2/pplx-ep-decode.md
@@ -122,7 +122,7 @@ block_size=64 意味着 `thread_m_blocks=4`(large-batch config),每个 m_b
### #0 PPLX EP baseline(2026-05-23)
-从 TP8+EP8 NCCL 路径 fork,接入 `pegainfer-comm::EpBackend`。PPLX 4-step protocol 替换 NCCL RS bridge,router scale 在 combine_recv 后单独应用(`accumulate=false` + `kimi_scaled_add_f32_bf16_to_bf16`)。
+从 TP8+EP8 NCCL 路径 fork,接入 `openinfer-comm::EpBackend`。PPLX 4-step protocol 替换 NCCL RS bridge,router scale 在 combine_recv 后单独应用(`accumulate=false` + `kimi_scaled_add_f32_bf16_to_bf16`)。
初始 bench_serving 测得 PPLX TPOT ≈ 37ms,NCCL no-graph ≈ 19ms。
@@ -183,11 +183,11 @@ prefix sum 是串行的,但只有 48 iterations in shared memory——不值
| File | 改动 |
| --- | --- |
-| `pegainfer-kimi-k2/src/runner/moe_pplx.rs` | PPLX_EXPERT_PADDING 64→8, block_size 硬编码 8, forward 逻辑 |
-| `pegainfer-kernels/csrc/kimi_k2/kimi_experts.cu` | routing kernel 并行化 <<<1,64>>>, shared memory prefix sum |
-| `pegainfer-kernels/csrc/kimi_k2/kimi_marlin_wna16.cu` | `swiglu_w13_pplx_kernel` + C wrapper |
-| `pegainfer-kernels/src/ops/kimi_k2/experts.rs` | `kimi_pplx_build_marlin_routing_on_stream`, tight_max 计算, PPLX GEMM wrappers |
-| `pegainfer-kernels/src/ffi.rs` | FFI declarations |
+| `openinfer-kimi-k2/src/runner/moe_pplx.rs` | PPLX_EXPERT_PADDING 64→8, block_size 硬编码 8, forward 逻辑 |
+| `openinfer-kernels/csrc/kimi_k2/kimi_experts.cu` | routing kernel 并行化 <<<1,64>>>, shared memory prefix sum |
+| `openinfer-kernels/csrc/kimi_k2/kimi_marlin_wna16.cu` | `swiglu_w13_pplx_kernel` + C wrapper |
+| `openinfer-kernels/src/ops/kimi_k2/experts.rs` | `kimi_pplx_build_marlin_routing_on_stream`, tight_max 计算, PPLX GEMM wrappers |
+| `openinfer-kernels/src/ffi.rs` | FFI declarations |
## Open
diff --git a/docs/models/kimi-k2/roadmap.md b/docs/models/kimi-k2/roadmap.md
index 531dd91f..9bd88f94 100644
--- a/docs/models/kimi-k2/roadmap.md
+++ b/docs/models/kimi-k2/roadmap.md
@@ -52,7 +52,7 @@ So routing diversity is worth **~7–15%** of decode TPOT, non-linear in K (flat
| CUDA graph decode | ✓ DeepEP full-bucket capture, −12% TPOT | #227/#298 |
| Bench-regression snapshot | ✓ `bench_snapshots/h200/kimi-k2.6.json` | #232 |
| Lint gate (kernels + comm) | ✓ scoped `-D warnings` hook | #233 |
-| LoRA | N/A — server rejects cleanly | `pegainfer-server/src/main.rs` |
+| LoRA | N/A — server rejects cleanly | `openinfer-server/src/main.rs` |
## Claim boundaries
diff --git a/docs/models/kimi-k2/sampling.md b/docs/models/kimi-k2/sampling.md
index cc958b32..14877353 100644
--- a/docs/models/kimi-k2/sampling.md
+++ b/docs/models/kimi-k2/sampling.md
@@ -7,8 +7,8 @@ Last touched: 2026-06
## Param surface (`/v1/completions`)
What a client can send vs. what actually happens. "Frontend" = the vllm-server
-OpenAI layer + `pegainfer-vllm-frontend` conversion
-(`pegainfer-vllm-frontend/src/lib.rs` `convert_sampling`); "engine" = the kimi
+OpenAI layer + `openinfer-vllm-frontend` conversion
+(`openinfer-vllm-frontend/src/lib.rs` `convert_sampling`); "engine" = the kimi
scheduler/worker.
| Param | TP1/DP8 | TP8 | Where decided |
@@ -32,7 +32,7 @@ Rejection UX pitfall: an engine-side rejection surfaces as a generic HTTP 500
vllm-server OpenAI layer swallows the text. Check the server log
(`vllm_engine_core_client::client::stream "request failed"`) when a client
reports a 500. Fixing the mapping is a vllm-rust-workspace change, not a
-pegainfer one.
+openinfer one.
## Design (TP1/DP8)
diff --git a/docs/models/kimi-k2/source-layout.md b/docs/models/kimi-k2/source-layout.md
index 4d4d8b2c..c5966e7c 100644
--- a/docs/models/kimi-k2/source-layout.md
+++ b/docs/models/kimi-k2/source-layout.md
@@ -1,6 +1,6 @@
# Kimi-K2 Source Layout
-> **TL;DR:** Kimi-K2 source files over 1k lines were split by responsibility; the largest Rust file under `pegainfer-kimi-k2/src` is now `layers/attention.rs` at 950 lines.
+> **TL;DR:** Kimi-K2 source files over 1k lines were split by responsibility; the largest Rust file under `openinfer-kimi-k2/src` is now `layers/attention.rs` at 950 lines.
>
> **Last touched:** 2026-05
@@ -9,13 +9,13 @@
- **Read**:
- `docs/index.md` - routed the cleanup to the Kimi-K2 model docs.
- `docs/models/kimi-k2/bringup-history.md` - confirmed `worker.rs` owns decode arena, forward, routing, and sampling paths.
- - `pegainfer-kimi-k2/src/layers/attention.rs` - found tensor-view wrappers and validation helpers mixed into the attention header API.
- - `pegainfer-kimi-k2/src/layers/experts.rs` - found tests embedded at the end of the expert header API.
- - `pegainfer-kimi-k2/src/runner/worker.rs` - found rank worker ownership, state command handling, arena/cache logic, forward kernels, load helpers, runtime helpers, and tests in one file.
+ - `openinfer-kimi-k2/src/layers/attention.rs` - found tensor-view wrappers and validation helpers mixed into the attention header API.
+ - `openinfer-kimi-k2/src/layers/experts.rs` - found tests embedded at the end of the expert header API.
+ - `openinfer-kimi-k2/src/runner/worker.rs` - found rank worker ownership, state command handling, arena/cache logic, forward kernels, load helpers, runtime helpers, and tests in one file.
- **Relevant history**:
- `docs/models/kimi-k2/bringup-history.md` records CUDA Graph and decode arena constraints; splits must preserve pointer-stable decode behavior and not change allocation sites.
- **Plan**:
- 1. List Rust files under `pegainfer-kimi-k2/src` over 1k lines.
+ 1. List Rust files under `openinfer-kimi-k2/src` over 1k lines.
2. Split low-risk header/API files first: attention tensor wrappers/validation helpers and expert tests.
3. Split `runner/worker.rs` by runtime responsibility: state command handling, cache/arena ownership, forward kernels, load helpers, and runtime helpers.
4. Run formatting and Kimi feature compile checks.
@@ -24,35 +24,35 @@
### Step 1: List oversized files
-- Ran `find pegainfer-kimi-k2/src -name '*.rs' -type f -print0 | xargs -0 wc -l`.
+- Ran `find openinfer-kimi-k2/src -name '*.rs' -type f -print0 | xargs -0 wc -l`.
- Files over 1k lines before splitting:
- - `pegainfer-kimi-k2/src/runner/worker.rs` - 2799 lines.
- - `pegainfer-kimi-k2/src/layers/attention.rs` - 1146 lines.
- - `pegainfer-kimi-k2/src/layers/experts.rs` - 1008 lines.
+ - `openinfer-kimi-k2/src/runner/worker.rs` - 2799 lines.
+ - `openinfer-kimi-k2/src/layers/attention.rs` - 1146 lines.
+ - `openinfer-kimi-k2/src/layers/experts.rs` - 1008 lines.
### Step 2: Split header/API modules
-- Moved attention tensor view wrappers to `pegainfer-kimi-k2/src/layers/attention/tensors.rs`.
-- Moved attention validation helpers to `pegainfer-kimi-k2/src/layers/attention/validation.rs`.
-- Moved expert tests to `pegainfer-kimi-k2/src/layers/experts/tests.rs`.
+- Moved attention tensor view wrappers to `openinfer-kimi-k2/src/layers/attention/tensors.rs`.
+- Moved attention validation helpers to `openinfer-kimi-k2/src/layers/attention/validation.rs`.
+- Moved expert tests to `openinfer-kimi-k2/src/layers/experts/tests.rs`.
### Step 3: Split rank worker
-- Moved `KimiRankThreadState` command handling to `pegainfer-kimi-k2/src/runner/worker/state.rs`.
-- Moved decode cache/arena/scratch impls to `pegainfer-kimi-k2/src/runner/worker/cache.rs`.
-- Moved forward kernel paths to `pegainfer-kimi-k2/src/runner/worker/forward.rs`.
-- Moved weight-cache loading and shape checks to `pegainfer-kimi-k2/src/runner/worker/load.rs`.
-- Moved collectives, RoPE helpers, sampling helpers, and decode scalar helpers to `pegainfer-kimi-k2/src/runner/worker/runtime.rs`.
+- Moved `KimiRankThreadState` command handling to `openinfer-kimi-k2/src/runner/worker/state.rs`.
+- Moved decode cache/arena/scratch impls to `openinfer-kimi-k2/src/runner/worker/cache.rs`.
+- Moved forward kernel paths to `openinfer-kimi-k2/src/runner/worker/forward.rs`.
+- Moved weight-cache loading and shape checks to `openinfer-kimi-k2/src/runner/worker/load.rs`.
+- Moved collectives, RoPE helpers, sampling helpers, and decode scalar helpers to `openinfer-kimi-k2/src/runner/worker/runtime.rs`.
### Step 4: Verify
- `cargo fmt --all --check` passed.
-- `PEGAINFER_CUDA_SM=90a PEGAINFER_TRITON_PYTHON=$LOCAL_PEGAINFER_DIR/.venv/bin/python3 cargo check --release -p pegainfer-kimi-k2 --features kimi-k2 --tests` passed.
-- `PEGAINFER_CUDA_SM=90a PEGAINFER_TRITON_PYTHON=$LOCAL_PEGAINFER_DIR/.venv/bin/python3 cargo check --release -p pegainfer-kimi-k2 --lib` passed after gating Kimi runtime/weights exports behind the crate `kimi-k2` feature.
+- `OPENINFER_CUDA_SM=90a OPENINFER_TRITON_PYTHON=$LOCAL_OPENINFER_DIR/.venv/bin/python3 cargo check --release -p openinfer-kimi-k2 --features kimi-k2 --tests` passed.
+- `OPENINFER_CUDA_SM=90a OPENINFER_TRITON_PYTHON=$LOCAL_OPENINFER_DIR/.venv/bin/python3 cargo check --release -p openinfer-kimi-k2 --lib` passed after gating Kimi runtime/weights exports behind the crate `kimi-k2` feature.
## Debrief
-- **Outcome**: All Rust files under `pegainfer-kimi-k2/src` are now below 1k lines; the worker split preserved the Kimi feature compile gate and the default config/tokenizer build.
+- **Outcome**: All Rust files under `openinfer-kimi-k2/src` are now below 1k lines; the worker split preserved the Kimi feature compile gate and the default config/tokenizer build.
- **Pitfalls encountered**:
- Rust module visibility needed explicit promotion for methods moved under `runner/worker/*`.
- The default feature check exposed that Kimi runtime/weights exports were visible without the `kimi-k2` kernel feature.
diff --git a/docs/models/kimi-k2/tp1-dp8-ep8-performance.md b/docs/models/kimi-k2/tp1-dp8-ep8-performance.md
index 77f9b083..42803341 100644
--- a/docs/models/kimi-k2/tp1-dp8-ep8-performance.md
+++ b/docs/models/kimi-k2/tp1-dp8-ep8-performance.md
@@ -1,6 +1,6 @@
# Kimi-K2 TP1 DP8 EP8 performance
-> TL;DR: This ledger tracks pegainfer TP1+DP8+EP8 on 8x H20 against the vLLM TP1+DP8+EP8 bs64 target. The vLLM sustained bs64 `~106ms` TPOT is now explained by a DPLB/CUDA-graph bucket cliff: an uneven DP distribution such as `9,8,8,8,8,8,8,7` pads every rank from graph bucket 8 to 16 and doubles TPOT. O2 landed five production decode-kernel picks (cuBLASLt fixed-shape shared_gate_up / o_proj / MLA strided-batch, split-vocab argmax, fused router selector); accuracy held at the bf16 ULP floor by a base-vs-opt prefill logits A/B, and the PPLX Marlin small-N tile was identified as the messy branch's real accuracy break (`-inf` logits + SIGSEGV at small per-rank N) and rejected. bs64 TPOT is unchanged within noise (p50 `40.58 -> 40.09ms`): the per-kernel wins do not resolve above the ±1ms band at this shape. Every pegainfer optimization must start from a profile, state the expected gain, show a microbench or isolated measurement, then pass correctness and service-level gates before commit.
+> TL;DR: This ledger tracks openinfer TP1+DP8+EP8 on 8x H20 against the vLLM TP1+DP8+EP8 bs64 target. The vLLM sustained bs64 `~106ms` TPOT is now explained by a DPLB/CUDA-graph bucket cliff: an uneven DP distribution such as `9,8,8,8,8,8,8,7` pads every rank from graph bucket 8 to 16 and doubles TPOT. O2 landed five production decode-kernel picks (cuBLASLt fixed-shape shared_gate_up / o_proj / MLA strided-batch, split-vocab argmax, fused router selector); accuracy held at the bf16 ULP floor by a base-vs-opt prefill logits A/B, and the PPLX Marlin small-N tile was identified as the messy branch's real accuracy break (`-inf` logits + SIGSEGV at small per-rank N) and rejected. bs64 TPOT is unchanged within noise (p50 `40.58 -> 40.09ms`): the per-kernel wins do not resolve above the ±1ms band at this shape. Every openinfer optimization must start from a profile, state the expected gain, show a microbench or isolated measurement, then pass correctness and service-level gates before commit.
>
> Last touched: 2026-06-07
@@ -44,42 +44,42 @@ For TP1 DP8, correctness checks must include uneven per-rank active rows and emp
Path placeholders:
```bash
-export PEGAINFER_DIR=/path/to/pegainfer
+export OPENINFER_DIR=/path/to/openinfer
export VLLM_DIR=/path/to/vllm_test
export MODEL_DIR=/path/to/Kimi-K2.5
export NCCL_LIB_DIR=/path/to/nccl-lib
export EVAL_VENV=/path/to/eval-venv
export RESULT_ROOT=/path/to/result-root
-export TRITON_PYTHON=$PEGAINFER_DIR/.triton-venv/bin/python
+export TRITON_PYTHON=$OPENINFER_DIR/.triton-venv/bin/python
```
Build on an :
```bash
-cd "$PEGAINFER_DIR"
+cd "$OPENINFER_DIR"
CUDA_HOME=/usr/local/cuda \
NVCC=/usr/local/cuda/bin/nvcc \
LD_LIBRARY_PATH="$NCCL_LIB_DIR:/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}" \
-PEGAINFER_CUDA_SM=90a \
-PEGAINFER_TRITON_PYTHON="$TRITON_PYTHON" \
-cargo build --release -p pegainfer-server \
- --features kimi-k2 --bin pegainfer --bin bench_serving
+OPENINFER_CUDA_SM=90a \
+OPENINFER_TRITON_PYTHON="$TRITON_PYTHON" \
+cargo build --release -p openinfer-server \
+ --features kimi-k2 --bin openinfer --bin bench_serving
```
-(The old `kimi-k2-pplx-ep` feature and `PEGAINFER_KIMI_PARALLEL` env existed only on the
+(The old `kimi-k2-pplx-ep` feature and `OPENINFER_KIMI_PARALLEL` env existed only on the
pre-merge branch; on main the feature is `kimi-k2` and parallel shape is selected by the
`--tp-size/--dp-size/--ep-backend` CLI flags below. nvcc must also be on `PATH` — the
-`pegainfer-comm` cc-rs build looks it up there, not via `$NVCC`.)
+`openinfer-comm` cc-rs build looks it up there, not via `$NVCC`.)
In-process bs64:
```bash
-cd "$PEGAINFER_DIR"
+cd "$OPENINFER_DIR"
CUDA_HOME=/usr/local/cuda \
NVCC=/usr/local/cuda/bin/nvcc \
LD_LIBRARY_PATH="$NCCL_LIB_DIR:/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}" \
-PEGAINFER_CUDA_SM=90a \
-PEGAINFER_TRITON_PYTHON="$TRITON_PYTHON" \
+OPENINFER_CUDA_SM=90a \
+OPENINFER_TRITON_PYTHON="$TRITON_PYTHON" \
target/release/bench_serving \
--model-path "$MODEL_DIR" \
--cuda-graph false \
@@ -92,13 +92,13 @@ target/release/bench_serving \
Service bs64, same client shape as vLLM:
```bash
-cd "$PEGAINFER_DIR"
+cd "$OPENINFER_DIR"
CUDA_HOME=/usr/local/cuda \
NVCC=/usr/local/cuda/bin/nvcc \
LD_LIBRARY_PATH="$NCCL_LIB_DIR:/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}" \
-PEGAINFER_CUDA_SM=90a \
-PEGAINFER_TRITON_PYTHON="$TRITON_PYTHON" \
-target/release/pegainfer --model-path "$MODEL_DIR" --served-model-name kimi-k2.5 \
+OPENINFER_CUDA_SM=90a \
+OPENINFER_TRITON_PYTHON="$TRITON_PYTHON" \
+target/release/openinfer --model-path "$MODEL_DIR" --served-model-name kimi-k2.5 \
--port 8124 --cuda-graph false --tp-size 1 --dp-size 8 --ep-backend pplx
```
@@ -125,13 +125,13 @@ vllm bench serve \
--save-result \
--save-detailed \
--result-dir "$RESULT_ROOT/kimi-tp1dp8-service" \
- --result-filename pegainfer_tp1dp8_bs64_${COMMIT}.json
+ --result-filename openinfer_tp1dp8_bs64_${COMMIT}.json
```
GSM8K accuracy smoke, concurrent OpenAI `/v1/completions` path:
```bash
-cd "$PEGAINFER_DIR"
+cd "$OPENINFER_DIR"
source "$EVAL_VENV/bin/activate"
lm_eval run --model local-completions \
--model_args "model=kimi-k2.5,base_url=http://127.0.0.1:8125/v1/completions,tokenizer_backend=huggingface,tokenizer=$MODEL_DIR,tokenized_requests=False,trust_remote_code=True,max_length=4096,max_gen_toks=256,num_concurrent=16,timeout=300" \
@@ -190,13 +190,13 @@ vllm bench serve \
nsys profile:
```bash
-cd "$PEGAINFER_DIR"
+cd "$OPENINFER_DIR"
mkdir -p "$RESULT_ROOT/kimi-profile"
CUDA_HOME=/usr/local/cuda \
NVCC=/usr/local/cuda/bin/nvcc \
LD_LIBRARY_PATH="$NCCL_LIB_DIR:/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}" \
-PEGAINFER_CUDA_SM=90a \
-PEGAINFER_TRITON_PYTHON="$TRITON_PYTHON" \
+OPENINFER_CUDA_SM=90a \
+OPENINFER_TRITON_PYTHON="$TRITON_PYTHON" \
nsys profile --force-overwrite=true --trace=cuda,nvtx \
--cuda-graph-trace=node --export=sqlite \
-o "$RESULT_ROOT/kimi-profile/tp1dp8_bs64_o128_${COMMIT}" \
@@ -235,7 +235,7 @@ Motivation and expected gain:
Change:
-- `pegainfer-kimi-k2/src/runner/engine.rs`
+- `openinfer-kimi-k2/src/runner/engine.rs`
- `MAX_BATCH_PER_DP: 4 -> 8`.
- Added prompt_len1 admission batching in `DpCoordinator`.
- For prompt_len1 requests, send `StepCommand::Decode { positions: vec![0], slots, decode_batch_size: MAX_BATCH_PER_DP }` instead of `Prefill`.
@@ -293,12 +293,12 @@ Correctness:
CUDA_HOME=/usr/local/cuda \
NVCC=/usr/local/cuda/bin/nvcc \
LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-} \
-cargo test -r -p pegainfer-kimi-k2 --features pplx-ep runner::engine::tests --no-fail-fast
+cargo test -r -p openinfer-kimi-k2 --features pplx-ep runner::engine::tests --no-fail-fast
```
- Local result: `5 passed`.
- H20 result at `0c23389`: `5 passed`.
-- Mixed-arrival service test, `$RESULT_ROOT/kimi-tp1dp8-service/pegainfer_tp1dp8_mixed_arrival_prompt1_o64_0c23389.json`:
+- Mixed-arrival service test, `$RESULT_ROOT/kimi-tp1dp8-service/openinfer_tp1dp8_mixed_arrival_prompt1_o64_0c23389.json`:
`64/64` success with `--request-rate 16`, peak concurrent requests `54`, TTFT p50/p99
`58.10/110.88ms`, TPOT p50/p99 `35.91/37.63ms`. This covers prompt_len1
admissions landing while existing decode slots are active.
@@ -309,7 +309,7 @@ Performance:
`64/64` success, TTFT p50/p99 `74.62/77.19ms`, first decode p50/p99
`38.23/38.24ms`, steady TPOT p50/p95/p99 `40.10/43.32/43.72ms`.
- Service, same `vllm bench serve` client as vLLM,
- `$RESULT_ROOT/kimi-tp1dp8-service/pegainfer_tp1dp8_bs64_o128_0c23389_after_warmup.json`:
+ `$RESULT_ROOT/kimi-tp1dp8-service/openinfer_tp1dp8_bs64_o128_0c23389_after_warmup.json`:
`256/256` success, output `1336.35 tok/s`, TTFT p50/p99 `105.31/127.81ms`,
TPOT p50/p95/p99 `47.34/47.70/47.71ms`, ITL p50/p99 `47.84/50.69ms`.
- vLLM warmup-after baseline,
@@ -365,7 +365,7 @@ Decision:
- Keep as the current H20 bs64 performance baseline. O1 moves prompt_len=1 onto the decode
shape and clears the vLLM bs64 TPOT/output gate; full token-parity correctness remains a
separate reference gate before using TP1 DP8 as an accuracy baseline. Follow-up profiles should
- focus on lowering pegainfer service TPOT from `47ms` toward the H200-reported 30ms-class
+ focus on lowering openinfer service TPOT from `47ms` toward the H200-reported 30ms-class
expectation if that target is confirmed on comparable hardware.
### O2 - decode kernel cherry-pick: cuBLASLt fixed-shape GEMMs, argmax split, router fusion
@@ -410,7 +410,7 @@ Rejected: PPLX Marlin small-N tile (messy-branch `dd69876`) — the accuracy bre
Accuracy gate: base-vs-opt prefill logits A/B. GSM8K-class evals are too coarse for
ULP-level kernel drift, so the gate follows `subsystems/correctness/logits-golden-gate.md`
-with base-pegainfer itself as the reference at the same TP1 DP8 PPLX config: a throwaway
+with base-openinfer itself as the reference at the same TP1 DP8 PPLX config: a throwaway
(uncommitted) hook after the prefill lm_head GEMM in `runner/worker/state.rs` dumps
full-vocab bf16 logits at every prompt position for 12 fixed raw prompts (en/zh/code/math,
1..90 tokens) sent through `/v1/completions` at `max_tokens=1`, identical patch on base
diff --git a/docs/models/kimi-k2/vllm-h20-baseline.md b/docs/models/kimi-k2/vllm-h20-baseline.md
index 5890b772..0afc54b8 100644
--- a/docs/models/kimi-k2/vllm-h20-baseline.md
+++ b/docs/models/kimi-k2/vllm-h20-baseline.md
@@ -1,6 +1,6 @@
# Kimi-K2 vLLM H20 Baseline (decode-heavy)
-> **TL;DR:** vLLM `0.19.0` + Kimi-K2.5 + 8× H20,TP1+DP8+EP8(NCCL allgather/reducescatter all2all)跑 `bench serve` decode-heavy profile(input=1, output=128, ignore-eos)。bs=1..256 扫描。这是 vLLM 侧的 baseline 数据快照,作为 pegainfer TP1+DP8+EP8 active line 的硬上限(pegainfer 当前数据见 [tp1-dp8-ep8-performance.md](tp1-dp8-ep8-performance.md))。下面的 pegainfer 列是 **历史 TP8+EP8 bs=4 bring-up 对照**(TPOT med `19.13ms` vs vLLM `24.97ms`,HTTP 比 in-process `14.39ms` 高 33%),保留作为 frontend/streaming overhead 的早期记录。
+> **TL;DR:** vLLM `0.19.0` + Kimi-K2.5 + 8× H20,TP1+DP8+EP8(NCCL allgather/reducescatter all2all)跑 `bench serve` decode-heavy profile(input=1, output=128, ignore-eos)。bs=1..256 扫描。这是 vLLM 侧的 baseline 数据快照,作为 openinfer TP1+DP8+EP8 active line 的硬上限(openinfer 当前数据见 [tp1-dp8-ep8-performance.md](tp1-dp8-ep8-performance.md))。下面的 openinfer 列是 **历史 TP8+EP8 bs=4 bring-up 对照**(TPOT med `19.13ms` vs vLLM `24.97ms`,HTTP 比 in-process `14.39ms` 高 33%),保留作为 frontend/streaming overhead 的早期记录。
>
> **Last touched:** 2026-06
@@ -42,7 +42,7 @@ cross-hardware/backend check.
| Model | Kimi-K2.5(local `$MODEL_DIR`,INT4 + BF16 scale Marlin WNA16) |
| vLLM | `0.19.0`(venv `$VLLM_DIR/.venv`) |
| Sharding | **vLLM**: TP=1, DP=8, EP=8,all2all backend `allgather_reducescatter`(NCCL,默认) |
-| Sharding | **pegainfer**: TP=8, EP=8,NCCL F32 hidden bridge + RS routed bridge |
+| Sharding | **openinfer**: TP=8, EP=8,NCCL F32 hidden bridge + RS routed bridge |
| Profile | input_len=1, output_len=128, `--ignore-eos`, `--random-range-ratio 0` |
| Bench | `vllm bench serve --backend openai --endpoint /v1/completions`(同一 client,两边对齐) |
| 数据 | `$VLLM_DIR/kimi_dp8_baseline/result_*.json` |
@@ -80,11 +80,11 @@ cross-hardware/backend check.
**bs=8 ≈ 拐点**:从这一点开始多塞 batch 单请求体验快速恶化,aggregate throughput 增益逐渐被 8 倍 batch / 8 倍 latency 抵消。Decode 性能口径下 vLLM TP1+DP8+EP8 的 "sweet spot" 是 bs=8(8 路 DP 各自 bs=1)。
-## pegainfer bs=4 对照点
+## openinfer bs=4 对照点
-下表是历史 TP8+EP8 bring-up 对照(当时 `KIMI_RUNNER_MAX_BATCH=4`,没扫 bs;该 const 现在是 `64`,bucketed)。同 client / 同 profile(`vllm bench serve`,input=1, output=128, ignore-eos, max_concurrency=4)打 pegainfer OpenAI-compatible server:
+下表是历史 TP8+EP8 bring-up 对照(当时 `KIMI_RUNNER_MAX_BATCH=4`,没扫 bs;该 const 现在是 `64`,bucketed)。同 client / 同 profile(`vllm bench serve`,input=1, output=128, ignore-eos, max_concurrency=4)打 openinfer OpenAI-compatible server:
-| 指标 | pegainfer TP8+EP8(HTTP, vllm bench) | vLLM TP1+DP8+EP8 bs=4 | pegainfer in-process bench, bs4 |
+| 指标 | openinfer TP8+EP8(HTTP, vllm bench) | vLLM TP1+DP8+EP8 bs=4 | openinfer in-process bench, bs4 |
| --- | ---: | ---: | ---: |
| TPOT median | `19.13ms` | `24.97ms` | `14.39ms` |
| TPOT p99 | `23.63ms` | `29.46ms` | `14.83ms` |
@@ -94,19 +94,19 @@ cross-hardware/backend check.
| Output tok/s | `159.99` | `157.94` | `≈278` |
数据来源:
-- pegainfer HTTP:`result_pegainfer_bs4.json`,server 是 `target/release/pegainfer --model-path $MODEL_DIR --port 8124 --cuda-graph true`,client 同 vLLM 那条 bench。
+- openinfer HTTP:`result_openinfer_bs4.json`,server 是 `target/release/openinfer --model-path $MODEL_DIR --port 8124 --cuda-graph true`,client 同 vLLM 那条 bench。
- vLLM bs=4:上面 sweep 表的 bs=4 行。
-- pegainfer in-process:`bench_serving request --cuda-graph true --concurrency 4`,见 optimization.md。
+- openinfer in-process:`bench_serving request --cuda-graph true --concurrency 4`,见 optimization.md。
### 结论
-1. **同硬件、同 client、同 profile,pegainfer TPOT 比 vLLM 低 23%**(`19.13 vs 24.97`)。预期内:pegainfer 走 TP=8 把单 token MLA / dense / shared expert 的 GEMM 切到 8 rank,每发 token 跨 rank reduce 一次;vLLM TP=1 时单 rank 自己跑完整 GEMM,靠 DP=8 拿 throughput 但单请求慢。Decode latency 主线上 TP8 仍然赢。
+1. **同硬件、同 client、同 profile,openinfer TPOT 比 vLLM 低 23%**(`19.13 vs 24.97`)。预期内:openinfer 走 TP=8 把单 token MLA / dense / shared expert 的 GEMM 切到 8 rank,每发 token 跨 rank reduce 一次;vLLM TP=1 时单 rank 自己跑完整 GEMM,靠 DP=8 拿 throughput 但单请求慢。Decode latency 主线上 TP8 仍然赢。
-2. **TTFT 这边 vLLM 完胜**:median `69.60ms` vs pegainfer `313.10ms`。pegainfer p99 飙到 `4239.97ms`——基本是 first-request 冷启动(first NCCL collective stream drain + scheduler warmup)。decode 优先的方案在 prefill 路径上欠的债集中爆在 p99。
+2. **TTFT 这边 vLLM 完胜**:median `69.60ms` vs openinfer `313.10ms`。openinfer p99 飙到 `4239.97ms`——基本是 first-request 冷启动(first NCCL collective stream drain + scheduler warmup)。decode 优先的方案在 prefill 路径上欠的债集中爆在 p99。
-3. **HTTP overhead 异常高**:pegainfer 同 bs=4,HTTP 口径 TPOT med `19.13ms`,in-process bench 是 `14.39ms`——4.74ms / token,~33% overhead。streaming JSON + frontend bridge 不该这么多。**这条单独提出来作为后续要查的 finding**,优先级介于 decode kernel 和 prefill 之间。
+3. **HTTP overhead 异常高**:openinfer 同 bs=4,HTTP 口径 TPOT med `19.13ms`,in-process bench 是 `14.39ms`——4.74ms / token,~33% overhead。streaming JSON + frontend bridge 不该这么多。**这条单独提出来作为后续要查的 finding**,优先级介于 decode kernel 和 prefill 之间。
-4. **Aggregate throughput 不公平比较(历史)**:当时 pegainfer 卡在 `KIMI_RUNNER_MAX_BATCH=4`(现已是 `64`,bucketed)不能扫 bs,vLLM TP1+DP8 在 bs=256 拉到 `1131 tok/s`。这条数据当时给 TP1+DP8+EP8 milestone 提供了上限:H20 ×8、相同 client 口径,**vLLM TP1+DP8+EP8 baseline 单请求 TPOT `17.94ms`(bs=1)/ aggregate `1131 tok/s`(bs=256)**。pegainfer 的 TP1+DP8+EP8 已落地,bs64 service output `1336 tok/s` / TPOT p50 `47.3ms`(见 [tp1-dp8-ep8-performance.md](tp1-dp8-ep8-performance.md))。
+4. **Aggregate throughput 不公平比较(历史)**:当时 openinfer 卡在 `KIMI_RUNNER_MAX_BATCH=4`(现已是 `64`,bucketed)不能扫 bs,vLLM TP1+DP8 在 bs=256 拉到 `1131 tok/s`。这条数据当时给 TP1+DP8+EP8 milestone 提供了上限:H20 ×8、相同 client 口径,**vLLM TP1+DP8+EP8 baseline 单请求 TPOT `17.94ms`(bs=1)/ aggregate `1131 tok/s`(bs=256)**。openinfer 的 TP1+DP8+EP8 已落地,bs64 service output `1336 tok/s` / TPOT p50 `47.3ms`(见 [tp1-dp8-ep8-performance.md](tp1-dp8-ep8-performance.md))。
## 复现命令
@@ -121,16 +121,16 @@ vllm serve $MODEL_DIR \
--port 8123 --max-num-seqs 256 --max-model-len 4096
```
-pegainfer server(8×H20 node)。Build 用 `cargo build --release -p pegainfer-server --features kimi-k2 --bin pegainfer`,parallel shape 由 CLI flag 选(当前 active line 是 TP1+DP8+EP8 PPLX,下面的 flag 即对齐 vLLM 形态做 apples 对照):
+openinfer server(8×H20 node)。Build 用 `cargo build --release -p openinfer-server --features kimi-k2 --bin openinfer`,parallel shape 由 CLI flag 选(当前 active line 是 TP1+DP8+EP8 PPLX,下面的 flag 即对齐 vLLM 形态做 apples 对照):
```bash
-LD_LIBRARY_PATH=$RESULT_ROOT/pegainfer-nccl-lib:$LD_LIBRARY_PATH \
- $PEGAINFER_DIR/target/release/pegainfer \
+LD_LIBRARY_PATH=$RESULT_ROOT/openinfer-nccl-lib:$LD_LIBRARY_PATH \
+ $OPENINFER_DIR/target/release/openinfer \
--model-path $MODEL_DIR --port 8124 --cuda-graph true \
--tp-size 1 --dp-size 8 --ep-backend pplx
```
-> 注:表里的 pegainfer 数据是历史 TP8+EP8 bs=4 口径(当时用旧的 `kimi-k2-pplx-ep` feature / `PEGAINFER_KIMI_PARALLEL` env,二者均已移除)。上面是当前 CLI 复现 active TP1+DP8+EP8 的命令,不是产生表里 TP8 数据的命令;8×H20 才能跑。
+> 注:表里的 openinfer 数据是历史 TP8+EP8 bs=4 口径(当时用旧的 `kimi-k2-pplx-ep` feature / `OPENINFER_KIMI_PARALLEL` env,二者均已移除)。上面是当前 CLI 复现 active TP1+DP8+EP8 的命令,不是产生表里 TP8 数据的命令;8×H20 才能跑。
bench(client 端,对哪个 server 改 `--base-url` 即可):
diff --git a/docs/models/kimi-k2/vllm-path-comparison.md b/docs/models/kimi-k2/vllm-path-comparison.md
index 4f5695bb..8b067313 100644
--- a/docs/models/kimi-k2/vllm-path-comparison.md
+++ b/docs/models/kimi-k2/vllm-path-comparison.md
@@ -1,6 +1,6 @@
# Kimi-K2 vLLM Path Comparison
-> **TL;DR:** vLLM Kimi/DeepSeekV3 decode 和 PegaInfer decode 的最大结构差异已缩小到 MLA cache/metadata 与 collective bridge:PegaInfer 现在同样用 load-time `fused_qkv_a_proj` 合并 `q_a + kv_a`,decode 执行 `gemm_graphsafe(fused_qkv_a)` 后用 `kimi_mla_split_qkv_a` 一次拆出 `q_a/compressed_kv/k_rope`。MoE shared/main 与 routed compute/aux stream overlap、shared gate/up fused GEMM、dense layer0 gate/up fused GEMM、routed scale+residual add fused kernel、routed sum clear 与 Marlin locks clear 清理已通过 H20 correctness/perf gate;真实 fixture output16 steady TPOT p99 `14.26ms`,synthetic output64 steady TPOT avg `14.39ms` / p99 `14.83ms`。vLLM TP-only MoE final all-reduce cadence 已实测 BF16/F32 两版均慢于当前 RS bridge,因此保留 RS bridge。
+> **TL;DR:** vLLM Kimi/DeepSeekV3 decode 和 OpenInfer decode 的最大结构差异已缩小到 MLA cache/metadata 与 collective bridge:OpenInfer 现在同样用 load-time `fused_qkv_a_proj` 合并 `q_a + kv_a`,decode 执行 `gemm_graphsafe(fused_qkv_a)` 后用 `kimi_mla_split_qkv_a` 一次拆出 `q_a/compressed_kv/k_rope`。MoE shared/main 与 routed compute/aux stream overlap、shared gate/up fused GEMM、dense layer0 gate/up fused GEMM、routed scale+residual add fused kernel、routed sum clear 与 Marlin locks clear 清理已通过 H20 correctness/perf gate;真实 fixture output16 steady TPOT p99 `14.26ms`,synthetic output64 steady TPOT avg `14.39ms` / p99 `14.83ms`。vLLM TP-only MoE final all-reduce cadence 已实测 BF16/F32 两版均慢于当前 RS bridge,因此保留 RS bridge。
>
> **Last touched:** 2026-05
@@ -18,12 +18,12 @@
- `$VLLM_DIR_ALT/vllm/model_executor/layers/fused_moe/layer.py`
- `$VLLM_DIR_ALT/csrc/cache_kernels.cu`
- `$VLLM_DIR_ALT/csrc/moe/*`
-- PegaInfer files:
- - `pegainfer-kimi-k2/src/direct/worker.rs`
- - `pegainfer-kimi-k2/src/batch_decode_trace.rs`
- - `pegainfer-kernels/src/ops/kimi_mla.rs`
- - `pegainfer-kernels/src/ops/kimi_router.rs`
- - `pegainfer-kernels/src/ops/kimi_experts.rs`
+- OpenInfer files:
+ - `openinfer-kimi-k2/src/direct/worker.rs`
+ - `openinfer-kimi-k2/src/batch_decode_trace.rs`
+ - `openinfer-kernels/src/ops/kimi_mla.rs`
+ - `openinfer-kernels/src/ops/kimi_router.rs`
+ - `openinfer-kernels/src/ops/kimi_experts.rs`
## vLLM Decode Operator List
@@ -52,20 +52,20 @@ This is the source-level list for Kimi/DeepSeekV3 decode, not an nsys trace. PyT
| MoE scale and TP reduce | For BF16, routed output is multiplied by `routed_scaling_factor`, added with shared output, then `maybe_all_reduce_tensor_model_parallel`. | `deepseek_v2.py:187-208` |
| Final logits | Final RMSNorm then LM head; sampling/logprobs live in vLLM sampling path rather than model file. | `deepseek_v2.py:724-725` |
-## PegaInfer Current Decode Operator List
+## OpenInfer Current Decode Operator List
This list follows the current worker implementation. The static trace is now source-aligned for these high-level operators after the MLA trace fix below.
-| Section | PegaInfer actual operator path | Source evidence |
+| Section | OpenInfer actual operator path | Source evidence |
| --- | --- | --- |
| Embedding | `embedding_batch_vocab_shard` then TP all-reduce through BF16-via-F32 bridge. | `batch_decode_trace.rs:49-63` |
| Attention input | `rms_norm_batch_into(hidden, input_norm)`. | `worker.rs:1777-1783` |
| MLA q/kv down projection | `gemm_graphsafe(fused_qkv_a_proj)` then `kimi_mla_split_qkv_a` produces `q_a`, `compressed_kv`, and `k_rope`; q branch then runs `rms_norm_batch(q_a_norm)` and `gemm_graphsafe(q_b_proj)`, kv branch runs `rms_norm_batch(kv_a_norm)`. | `worker.rs:1784-1827` |
| MLA RoPE split | `kimi_mla_rope_split_decode(q_proj, k_rope, cos, sin, positions)` produces `q_nope`, `q_pe`, and `append_kpe`. | `worker.rs:1839-1849` |
-| MLA q absorb | `kimi_mla_absorb_q_nope(kv_b_proj, q_nope)` uses preloaded `kv_b_proj` weight; this is the PegaInfer equivalent of vLLM `q_nope @ W_UK_T`. | `worker.rs:1850-1855` |
+| MLA q absorb | `kimi_mla_absorb_q_nope(kv_b_proj, q_nope)` uses preloaded `kv_b_proj` weight; this is the OpenInfer equivalent of vLLM `q_nope @ W_UK_T`. | `worker.rs:1850-1855` |
| MLA cache append | `kimi_mla_paged_kv_append(compressed_normed, append_kpe, page tables, positions)` writes worker-owned paged MLA KV. | `worker.rs:1856-1868` |
| MLA attention | `kimi_flashinfer_batch_decode_mla(q_abs_nope, q_pe, ckv_cache, kpe_cache, page tables, request_indices, kv metadata)`. | `worker.rs:1880-1895` |
-| MLA v up | `kimi_mla_v_up(kv_b_proj, latent)`; this is the PegaInfer equivalent of vLLM `_v_up_proj`. | `worker.rs:1907-1912` |
+| MLA v up | `kimi_mla_v_up(kv_b_proj, latent)`; this is the OpenInfer equivalent of vLLM `_v_up_proj`. | `worker.rs:1907-1912` |
| MLA output projection | `gemm_graphsafe(o_proj)` then TP all-reduce through BF16-via-F32 bridge, then residual add. | `worker.rs:1913-1934`, `batch_decode_trace.rs:279-291` |
| Dense layer 0 MLP | post-attn RMSNorm, separate gate/up GEMMs, `silu_mul_batch`, down GEMM, BF16-via-F32 TP all-reduce, residual add. | `batch_decode_trace.rs:294-327` |
| MoE shared expert | post-attn RMSNorm; load-time fused shared gate/up GEMM, `silu_mul_fused_batch_into`, shared down GEMM, BF16-via-F32 TP all-reduce. | `worker.rs:2201-2238` |
@@ -111,7 +111,7 @@ This count is source-aligned for the high-level worker operators. It still folds
## Trace Drift Fixed In This Session
-`pegainfer-kimi-k2/src/batch_decode_trace.rs` differed from `worker.rs` in the first draft of this document:
+`openinfer-kimi-k2/src/batch_decode_trace.rs` differed from `worker.rs` in the first draft of this document:
| Trace item | Current trace | Actual worker path | Effect |
| --- | --- | --- | --- |
@@ -129,20 +129,20 @@ Validation:
```bash
cargo fmt --all --check
-PEGAINFER_CUDA_SM=90a cargo check --release -p pegainfer-kimi-k2 --features kernel-report --bins
-PEGAINFER_CUDA_SM=90a cargo run --release -p pegainfer-kimi-k2 --features kernel-report --bin kimi_kernel_report -- \
+OPENINFER_CUDA_SM=90a cargo check --release -p openinfer-kimi-k2 --features kernel-report --bins
+OPENINFER_CUDA_SM=90a cargo run --release -p openinfer-kimi-k2 --features kernel-report --bin kimi_kernel_report -- \
trace --source static --batch-size 4 --kv-len 1024 --out $RESULT_ROOT/kimi_decode_trace_fixed_bs4_kv1024.json
```
-H20 validation for the fused-qkv patch used the same `cargo check` and static trace command under `$PEGAINFER_DIR` with `PEGAINFER_TRITON_PYTHON=$PEGAINFER_DIR/.venv-kimi/bin/python`; output was `calls=1886`, `gemm_graphsafe=367`, and `kimi_mla_split_qkv_a=61`.
+H20 validation for the fused-qkv patch used the same `cargo check` and static trace command under `$OPENINFER_DIR` with `OPENINFER_TRITON_PYTHON=$OPENINFER_DIR/.venv-kimi/bin/python`; output was `calls=1886`, `gemm_graphsafe=367`, and `kimi_mla_split_qkv_a=61`.
Runtime model-report validation on H20:
```bash
-LD_LIBRARY_PATH=$RESULT_ROOT/pegainfer-nccl-lib:${LD_LIBRARY_PATH:-} \
-PEGAINFER_CUDA_SM=90a \
-PEGAINFER_TRITON_PYTHON=$PEGAINFER_DIR/.venv-kimi/bin/python \
-cargo run --release -p pegainfer-kimi-k2 --features kernel-report --bin kimi_model_report -- \
+LD_LIBRARY_PATH=$RESULT_ROOT/openinfer-nccl-lib:${LD_LIBRARY_PATH:-} \
+OPENINFER_CUDA_SM=90a \
+OPENINFER_TRITON_PYTHON=$OPENINFER_DIR/.venv-kimi/bin/python \
+cargo run --release -p openinfer-kimi-k2 --features kernel-report --bin kimi_model_report -- \
decode --source runtime --batch-size 4 --kv-len 28 --iters 1 --format text \
--out $RESULT_ROOT/kimi_runtime_model_report_bs4_kv28_fixed_trace_v2.json
```
@@ -163,17 +163,17 @@ H20 graph serving gates after fused-qkv:
## Path Differences That Matter
-| Difference | vLLM | PegaInfer | Why it matters |
+| Difference | vLLM | OpenInfer | Why it matters |
| --- | --- | --- | --- |
| MLA first projection | One `MergedReplicatedLinear` for `[q_lora_rank, kv_lora_rank + rope_dim]`. | Now one load-time fused `DeviceMatrix` plus one graph-safe GEMM and one split kernel. | This structural delta is closed in code. The keep/revert gate is H20 correctness plus TPOT/model-report improvement. |
| Dense gate/up | V1 can use fused `gate_up_proj`; V0 module-level path still exposes gate/up. | Dense layer still uses separate gate/up; MoE shared expert now uses load-time fused gate/up GEMM. | One dense layer only matters little; shared expert repeat cost is now closed at the high-level GEMM count. |
| Router GEMM | V1 has small-batch `dsv3_router_gemm` / `router_gemm_bf16_fp32` path before grouped top-k. | `kimi_router_noaux_tc_launch` is a single custom router/top-k kernel path. | Need compare microbench, not assume; router was ~3.7ms/step in old strong-sync profile. |
-| MLA cache append and metadata | vLLM uses `concat_and_cache_mla`; FlashMLA prepares tile scheduler metadata and graph buffers. | PegaInfer uses `kimi_mla_paged_kv_append` and precomputed decode arena arrays. | Need compare metadata/cache append cost before changing attention kernels; trace currently hides this. |
-| MLA q absorb/v up | vLLM uses `torch.bmm` with preprocessed `W_UK_T/W_UV`. | PegaInfer custom kernels `kimi_mla_absorb_q_nope` and `kimi_mla_v_up` over `kv_b_proj`. | Semantically aligned, but microbench should decide whether custom kernels or cuBLAS batched GEMM wins for bs1..4. |
-| MoE WNA16 | Both use Marlin WNA16 route align, W13, SiLU, W2, sum. | PegaInfer has persistent workspace and explicit local EP route metadata. | Main MoE kernel choice is already aligned; next work is route histogram/tail and combine, not replacing WNA16. |
-| Routed combine | vLLM EP path maps local experts via `expert_map`; final tensor-parallel reduce happens through vLLM distributed path. | PegaInfer currently uses NCCL bridge: local sum -> repeat -> reduce-scatter -> fused scale+residual add. | This is not PPLX EP; it is graph-capturable but likely still extra data movement. |
-| TP collectives | vLLM parallel layers hide TP reductions; BF16 path does not visibly use our BF16-via-F32 bridge. | PegaInfer uses BF16-via-F32 bridge for hidden all-reduces because BF16 collective changed greedy output. | This is correctness-driven overhead; replacing it needs external vLLM greedy/top-k gate. |
-| Sampling/top1 | vLLM sampling/logprobs is integrated with its sampler path. | PegaInfer graph body ends at local top1; worker D2H reads local top1 and scheduler CPU-selects across ranks. | This graph-external boundary is real, but prior profile says it is not the largest item; fix after trace/accounting is accurate. |
+| MLA cache append and metadata | vLLM uses `concat_and_cache_mla`; FlashMLA prepares tile scheduler metadata and graph buffers. | OpenInfer uses `kimi_mla_paged_kv_append` and precomputed decode arena arrays. | Need compare metadata/cache append cost before changing attention kernels; trace currently hides this. |
+| MLA q absorb/v up | vLLM uses `torch.bmm` with preprocessed `W_UK_T/W_UV`. | OpenInfer custom kernels `kimi_mla_absorb_q_nope` and `kimi_mla_v_up` over `kv_b_proj`. | Semantically aligned, but microbench should decide whether custom kernels or cuBLAS batched GEMM wins for bs1..4. |
+| MoE WNA16 | Both use Marlin WNA16 route align, W13, SiLU, W2, sum. | OpenInfer has persistent workspace and explicit local EP route metadata. | Main MoE kernel choice is already aligned; next work is route histogram/tail and combine, not replacing WNA16. |
+| Routed combine | vLLM EP path maps local experts via `expert_map`; final tensor-parallel reduce happens through vLLM distributed path. | OpenInfer currently uses NCCL bridge: local sum -> repeat -> reduce-scatter -> fused scale+residual add. | This is not PPLX EP; it is graph-capturable but likely still extra data movement. |
+| TP collectives | vLLM parallel layers hide TP reductions; BF16 path does not visibly use our BF16-via-F32 bridge. | OpenInfer uses BF16-via-F32 bridge for hidden all-reduces because BF16 collective changed greedy output. | This is correctness-driven overhead; replacing it needs external vLLM greedy/top-k gate. |
+| Sampling/top1 | vLLM sampling/logprobs is integrated with its sampler path. | OpenInfer graph body ends at local top1; worker D2H reads local top1 and scheduler CPU-selects across ranks. | This graph-external boundary is real, but prior profile says it is not the largest item; fix after trace/accounting is accurate. |
## Routed Bridge Probe
@@ -181,9 +181,9 @@ Historical `kimi_graph_probe --probe routed-bridge-compare` (since retired, see
## TP-Only MoE Cadence Probe
-Hypatia 对 `$LOCAL_VLLM_DIR` 的 Kimi/DeepSeekV3 TP-only path 做了源码对照:vLLM decode 是 embedding `1` 次、attention `61` 次、dense layer0 `1` 次、MoE final `60` 次 BF16 all-reduce,总计 `123` 次 BF16 all-reduce,MoE TP-only path 不使用 reduce-scatter。PegaInfer 当前是同样 `123` 次 logical hidden all-reduce,再额外加 `60` 次 routed `repeat+RS` bridge。
+Hypatia 对 `$LOCAL_VLLM_DIR` 的 Kimi/DeepSeekV3 TP-only path 做了源码对照:vLLM decode 是 embedding `1` 次、attention `61` 次、dense layer0 `1` 次、MoE final `60` 次 BF16 all-reduce,总计 `123` 次 BF16 all-reduce,MoE TP-only path 不使用 reduce-scatter。OpenInfer 当前是同样 `123` 次 logical hidden all-reduce,再额外加 `60` 次 routed `repeat+RS` bridge。
-把 PegaInfer decode MoE 临时改成 vLLM TP-only final all-reduce 后,H20 correctness 通过但性能回退:
+把 OpenInfer decode MoE 临时改成 vLLM TP-only final all-reduce 后,H20 correctness 通过但性能回退:
| Variant | output16 steady | output64 steady | Decision |
| --- | --- | --- | --- |
@@ -198,4 +198,4 @@ Conclusion: source-level cadence parity alone is not a keep criterion. The next
1. Profile any remaining p99/max tail under dense/shared gate-up fusion plus routed scaled-add fusion and Marlin locks clear removal: output64 avg/p50/p95/p99 are now around `14.4/14.5/14.9/14.8ms`, with p99 under `15ms` in the latest kept gate.
2. Revisit full shared/EP communication overlap only with a production-shaped NCCL probe; isolated two-comm graph replay wins, but worker two-comm init/capture is not stable enough to ship.
3. Next graph-safe local wins: keep Marlin output clears unless route metadata proves every consumed row is written, add `kimi_mla_paged_kv_append` provider coverage, and design a real AG/RS or PPLX EP combine path that removes the repeat-for-RS bridge.
-4. Keep MoE WNA16 kernel path unchanged until the corrected report shows a measured win candidate; current vLLM/PegaInfer MoE compute path is already structurally close.
+4. Keep MoE WNA16 kernel path unchanged until the corrected report shows a measured win candidate; current vLLM/OpenInfer MoE compute path is already structurally close.
diff --git a/docs/models/qwen3/accuracy-gate.md b/docs/models/qwen3/accuracy-gate.md
index 63afb05f..18b0e332 100644
--- a/docs/models/qwen3/accuracy-gate.md
+++ b/docs/models/qwen3/accuracy-gate.md
@@ -1,6 +1,6 @@
# Qwen3-4B accuracy gate
-**TL;DR**: Qwen3-4B's logits are guarded by `tests/hf_golden_gate.rs` — a tolerance check against a stored HuggingFace bf16 golden, *not* an exact-text or hash baseline. It teacher-forces 48 fixed sequences and asserts pegainfer's logprobs stay at the bf16 noise floor of HF across bs=1 / batched eager / CUDA-graph. Strict guards: a structural **regret** check on the argmax + **mean** delta ≤ 0.06 nat + **p99** delta ≤ 0.20 nat; the absolute max is printed but not asserted (it is coverage-unstable). This is the reference implementation of the pattern in `subsystems/correctness/logits-golden-gate.md` — read that for the *why*; this doc is the Qwen3-4B *specifics*.
+**TL;DR**: Qwen3-4B's logits are guarded by `tests/hf_golden_gate.rs` — a tolerance check against a stored HuggingFace bf16 golden, *not* an exact-text or hash baseline. It teacher-forces 48 fixed sequences and asserts openinfer's logprobs stay at the bf16 noise floor of HF across bs=1 / batched eager / CUDA-graph. Strict guards: a structural **regret** check on the argmax + **mean** delta ≤ 0.06 nat + **p99** delta ≤ 0.20 nat; the absolute max is printed but not asserted (it is coverage-unstable). This is the reference implementation of the pattern in `subsystems/correctness/logits-golden-gate.md` — read that for the *why*; this doc is the Qwen3-4B *specifics*.
Last touched: 2026-05
@@ -16,7 +16,7 @@ The methodology (why HF, why a tolerance not a hash, why teacher-forcing, why re
| Reference top-K | HF bf16 top-64 logprobs per position | dumper |
| Regret tolerance | `MARGIN_TOL` = 0.20 nat | gate |
| Mean / p99 bounds | `MEAN_TOL` = 0.06, `P99_TOL` = 0.20 | gate |
-| Head tokens compared | top `HEAD_K` = 8 of pegainfer's own picks | gate |
+| Head tokens compared | top `HEAD_K` = 8 of openinfer's own picks | gate |
| Graph-bucket straddles | `BUCKET_STRADDLES = [9, 5]` (9→bucket 16 = 7 pad; 5→bucket 8 = 3 pad) | gate, from `batch_decode.rs` buckets |
Prompt lengths reach 256 tokens (up to 16 KV blocks at block_size 16) on purpose: the gate then exercises long-attention / KV-block indexing / high RoPE positions, not just short prompts.
@@ -46,7 +46,7 @@ Verified run, all four passes green in 26s:
| graph (9 padded) | 153 | 0.0337 | 0.0260 | 0.1297 | 0.4374 |
| graph (5 padded) | 85 | 0.0316 | 0.0253 | 0.1080 | 0.1410 |
-**mean (~0.032) and p99 (~0.12) are dead stable; only the absolute max moves** — which is why max is printed, not asserted. The single worst token (seq 7 / pos 5 / token 68172) is the *same* across bs=1 / eager-9 / graph-9: a deep-tail token at logprob ≈−10, far below the argmax. HF is fixed at −10.2508; pegainfer reads −9.8759 at bs=1 and −9.8134 in the 9-seq batch — the delta swings 0.3749→0.4374 purely from batch-dependent reduction order, with zero effect on the argmax. eager-9 and graph-9 are bit-identical, so the CUDA-graph path matches eager exactly at the same composition; only batch composition moves the number. As coverage grew (108→816 positions over the redesign) the max climbed 0.26→0.44 while mean/p99 held — the absolute max is a coverage treadmill, not a drift signal.
+**mean (~0.032) and p99 (~0.12) are dead stable; only the absolute max moves** — which is why max is printed, not asserted. The single worst token (seq 7 / pos 5 / token 68172) is the *same* across bs=1 / eager-9 / graph-9: a deep-tail token at logprob ≈−10, far below the argmax. HF is fixed at −10.2508; openinfer reads −9.8759 at bs=1 and −9.8134 in the 9-seq batch — the delta swings 0.3749→0.4374 purely from batch-dependent reduction order, with zero effect on the argmax. eager-9 and graph-9 are bit-identical, so the CUDA-graph path matches eager exactly at the same composition; only batch composition moves the number. As coverage grew (108→816 positions over the redesign) the max climbed 0.26→0.44 while mean/p99 held — the absolute max is a coverage treadmill, not a drift signal.
Tolerances were calibrated from this floor, strictly: `MEAN_TOL` 0.06 ≈ 2× the measured mean; `P99_TOL` 0.20 ≈ 1.6× the measured p99. Not comfortable round numbers — a loose gate would silently miss real drift smaller than its headroom.
@@ -58,8 +58,8 @@ After a change that legitimately alters numerical output, recompute the golden o
uv run --no-project python tools/accuracy/dump_qwen3_4b_hf_golden.py \
--model-path /data/models/Qwen3-4B --out test_data/qwen3-4b-hf-golden.safetensors
-PEGAINFER_TEST_MODEL_PATH=/data/models/Qwen3-4B \
- cargo test --release -p pegainfer-qwen3-4b --test hf_golden_gate -- --nocapture
+OPENINFER_TEST_MODEL_PATH=/data/models/Qwen3-4B \
+ cargo test --release -p openinfer-qwen3-4b --test hf_golden_gate -- --nocapture
```
## Diagnosing a red gate
@@ -67,8 +67,8 @@ PEGAINFER_TEST_MODEL_PATH=/data/models/Qwen3-4B \
The gate prints the full delta distribution and the worst position (`seq`, `pos`, `token`, both logprobs) before it fails. Read that first:
- **`mean` over `MEAN_TOL` (or `p99` over `P99_TOL`), max near the floor** → a *systematic* drift: something shifted every logit a little (a kernel change, a dtype/rounding change, a norm/RoPE regression). Real bug — bisect the change.
-- **`mean`/`p99` at the floor, one lone `max` outlier** → a localised token error, or just a new bf16 tail outlier on different hardware. Adjudicate with fp32: regenerate the golden with `--dtype float32` and compare. If pegainfer tracks fp32 truth as well as HF-bf16 does, it is bf16 noise — the gate does not assert max precisely so this should not have failed; if you must widen `MEAN_TOL`/`P99_TOL`, record the measurement and multiple here.
-- **regret / argmax violation** → HF had a clear winner (regret > 0.20 nat) and pegainfer disagreed, or pegainfer's pick is absent from HF's top-64 entirely. Almost always a real wrong-token bug; 0.20 nat is far above a tie.
+- **`mean`/`p99` at the floor, one lone `max` outlier** → a localised token error, or just a new bf16 tail outlier on different hardware. Adjudicate with fp32: regenerate the golden with `--dtype float32` and compare. If openinfer tracks fp32 truth as well as HF-bf16 does, it is bf16 noise — the gate does not assert max precisely so this should not have failed; if you must widen `MEAN_TOL`/`P99_TOL`, record the measurement and multiple here.
+- **regret / argmax violation** → HF had a clear winner (regret > 0.20 nat) and openinfer disagreed, or openinfer's pick is absent from HF's top-64 entirely. Almost always a real wrong-token bug; 0.20 nat is far above a tie.
## Next step
diff --git a/docs/models/qwen3/kernels-crate.md b/docs/models/qwen3/kernels-crate.md
index f6af61fb..734bb09f 100644
--- a/docs/models/qwen3/kernels-crate.md
+++ b/docs/models/qwen3/kernels-crate.md
@@ -2,78 +2,78 @@
**Created**: 2026-05-03
**Status**: complete
-**TL;DR**: Phase 1 now extracts the Qwen3-4B dense full-attention kernel surface into `crates/pegainfer-kernels`, with a compact kernel index so future LLM sessions can jump from model DAG nodes to Rust wrappers, FFI symbols, CUDA/Triton sources, and shape constraints. `KvPool`, `PagePool`, and `SamplingParams` stay in the root runtime. Local metadata/format checks pass; GPU release build, release test-target compilation, release clippy, Qwen3-4B e2e, and `bench_serving snapshot` pass.
+**TL;DR**: Phase 1 now extracts the Qwen3-4B dense full-attention kernel surface into `crates/openinfer-kernels`, with a compact kernel index so future LLM sessions can jump from model DAG nodes to Rust wrappers, FFI symbols, CUDA/Triton sources, and shape constraints. `KvPool`, `PagePool`, and `SamplingParams` stay in the root runtime. Local metadata/format checks pass; GPU release build, release test-target compilation, release clippy, Qwen3-4B e2e, and `bench_serving snapshot` pass.
## Preparation
- **Read**:
- `docs/index.md` - confirmed the relevant architecture, kernel, TP, benchmarking, and Qwen3 history docs.
- - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - recorded the per-model engine direction, but its near-term ordering needs to be corrected from ledger-first to crate-first.
+ - `docs/subsystems/kernels/openinfer-kernels-boundary.md` - recorded the per-model engine direction, but its near-term ordering needs to be corrected from ledger-first to crate-first.
- `docs/models/qwen3/tp-design.md` - confirmed Qwen3-4B TP constraints and runtime hazards around per-thread CUDA/cuBLAS state.
- `src/model/qwen3/*`, `src/ops/*`, `src/ffi.rs`, `src/tensor.rs`, `src/kv_pool.rs`, `src/page_pool.rs`, and `build.rs` - mapped the current Qwen3-4B kernel calls, tensor/runtime dependencies, paged KV metadata, and CUDA/Triton build pipeline.
- **Relevant history**:
- `docs/models/qwen3/tp-design.md` shows that Qwen3 execution is already rank-local and step-oriented, so the kernel crate must not hide device binding or TP collective points.
- **Plan**:
- 1. Convert the repository into a Cargo workspace while keeping the root `pegainfer` package as the server/control-plane crate.
- 2. Create `crates/pegainfer-kernels` with the Qwen3-4B kernel surface: kernel ABI tensor helpers, Qwen3-used `ops`, FFI declarations, CUDA/Triton build support, and Qwen3 paged-attention layout metadata helpers.
- 3. Move Qwen3 call sites to import `pegainfer_kernels::{ops, tensor}` and remove direct Qwen3 dependence on root-local `ops`, `ffi`, and `tensor` modules.
+ 1. Convert the repository into a Cargo workspace while keeping the root `openinfer` package as the server/control-plane crate.
+ 2. Create `crates/openinfer-kernels` with the Qwen3-4B kernel surface: kernel ABI tensor helpers, Qwen3-used `ops`, FFI declarations, CUDA/Triton build support, and Qwen3 paged-attention layout metadata helpers.
+ 3. Move Qwen3 call sites to import `openinfer_kernels::{ops, tensor}` and remove direct Qwen3 dependence on root-local `ops`, `ffi`, and `tensor` modules.
4. Preserve repository build health. If Qwen3.5 still requires symbols from the old combined CUDA library, either keep those symbols as compatibility exports in the kernels crate or explicitly document and gate any temporary Qwen3-only limitation before making code changes.
5. Add a kernel index for LLM navigation under the new crate:
- `KERNELS.md`: short human/LLM routing table from `qwen3_4b::::` to Rust wrapper, FFI symbol, source file, backend, shape/layout constraints, and status.
- Machine-readable model DAG metadata should wait for the Qwen3-4B model crate, where it can be generated or validated from model code instead of hand-maintained in the generic kernels crate.
- 6. Update `docs/subsystems/kernels/pegainfer-kernels-boundary.md` and `docs/index.md` so the recorded next step is crate-first, with ledger/trace/simulator as metadata products of the crate boundary.
+ 6. Update `docs/subsystems/kernels/openinfer-kernels-boundary.md` and `docs/index.md` so the recorded next step is crate-first, with ledger/trace/simulator as metadata products of the crate boundary.
7. Verify with `cargo test --release` or, if the local environment blocks full release tests, at least `cargo check --release` and report the exact blocker.
- **Risks / open questions**:
- - A strict Qwen3-only CUDA extraction can conflict with the current default binary because Qwen3.5 still compiles in the same root crate and references some shared FFI symbols. The safest implementation may need to move the link/build owner to `pegainfer-kernels` while only stabilizing and indexing the Qwen3 API first.
+ - A strict Qwen3-only CUDA extraction can conflict with the current default binary because Qwen3.5 still compiles in the same root crate and references some shared FFI symbols. The safest implementation may need to move the link/build owner to `openinfer-kernels` while only stabilizing and indexing the Qwen3 API first.
- `kv_pool` and `page_pool` sit between model state and kernel metadata. For Phase 1, only the kernel-facing layout/descriptor pieces should move if needed; scheduler-owned allocation policy should remain in the root crate unless compilation forces a narrower split.
- - Build-script path handling is fragile when moving kernel source into `crates/pegainfer-kernels/`. The plan should prefer one build owner and avoid compiling the same C symbols in both root and dependency crates.
+ - Build-script path handling is fragile when moving kernel source into `crates/openinfer-kernels/`. The plan should prefer one build owner and avoid compiling the same C symbols in both root and dependency crates.
## Execution Log
### Step 1: Create kernels crate and move build ownership
-- Converted the repository into a Cargo workspace with `crates/pegainfer-kernels`.
-- Added `pegainfer-kernels` as a root dependency.
-- Moved CUDA source from root `csrc/` to `crates/pegainfer-kernels/csrc/`.
-- Moved Triton AOT files from root `tools/triton/` to `crates/pegainfer-kernels/tools/triton/`.
-- Moved the FlashInfer submodule path from `third_party/flashinfer` to `crates/pegainfer-kernels/third_party/flashinfer`.
-- Replaced the root `build.rs` with an intentionally empty build script; `crates/pegainfer-kernels/build.rs` now owns CUDA/Triton compilation.
+- Converted the repository into a Cargo workspace with `crates/openinfer-kernels`.
+- Added `openinfer-kernels` as a root dependency.
+- Moved CUDA source from root `csrc/` to `crates/openinfer-kernels/csrc/`.
+- Moved Triton AOT files from root `tools/triton/` to `crates/openinfer-kernels/tools/triton/`.
+- Moved the FlashInfer submodule path from `third_party/flashinfer` to `crates/openinfer-kernels/third_party/flashinfer`.
+- Replaced the root `build.rs` with an intentionally empty build script; `crates/openinfer-kernels/build.rs` now owns CUDA/Triton compilation.
-- Moved kernel-owned ABI and operator code into `crates/pegainfer-kernels/src/`: `ffi`, tensor helpers, paged-KV geometry metadata, and the Qwen3-used `ops` modules.
+- Moved kernel-owned ABI and operator code into `crates/openinfer-kernels/src/`: `ffi`, tensor helpers, paged-KV geometry metadata, and the Qwen3-used `ops` modules.
- Kept `KvPool`, `PagePool`, and `SamplingParams` in the root crate because they are runtime allocation/policy state, not kernels.
- Replaced root `src/ffi.rs` and `src/tensor.rs` with compatibility re-exports.
-- Replaced root `src/ops.rs` with re-exports from `pegainfer-kernels` plus thin root adapters for sampling, paged prefill planning, paged attention layout conversion, and the remaining Qwen3.5 recurrent wrapper.
+- Replaced root `src/ops.rs` with re-exports from `openinfer-kernels` plus thin root adapters for sampling, paged prefill planning, paged attention layout conversion, and the remaining Qwen3.5 recurrent wrapper.
- Removed duplicate root `src/ops/{attention,elementwise,embedding,linear,norm,sampling}.rs`.
- Kept `src/ops/recurrent.rs` in root for now because it depends on Qwen3.5's model-local `GdrChunkwiseScratch35`; moving that would expand Phase 1 beyond Qwen3-4B.
### Step 3: Add kernel index for LLM navigation
-- Added `crates/pegainfer-kernels/KERNELS.md`.
+- Added `crates/openinfer-kernels/KERNELS.md`.
- The index maps each Qwen3-4B op ID to phase, Rust wrapper, FFI symbol, source file, backend, and shape/layout notes.
- Removed the initial `kernel_manifest/qwen3_4b.toml` idea from the kernels crate. A hand-maintained machine-readable manifest in the generic kernel crate would drift; the right place is the future Qwen3-4B model crate, where the manifest can describe the model DAG and be generated or checked against code.
### Step 4: Documentation updates
-- Updated `CLAUDE.md`, `README.md`, and `docs/playbooks/developer-onboarding.md` to point CUDA/Triton paths at `crates/pegainfer-kernels/`.
-- Updated `docs/subsystems/kernels/pegainfer-kernels-boundary.md` to record crate-first ordering before ledger/simulator work.
+- Updated `CLAUDE.md`, `README.md`, and `docs/playbooks/developer-onboarding.md` to point CUDA/Triton paths at `crates/openinfer-kernels/`.
+- Updated `docs/subsystems/kernels/openinfer-kernels-boundary.md` to record crate-first ordering before ledger/simulator work.
### Step 5: Verification
-- `cargo metadata --no-deps --format-version 1` succeeded and showed both workspace packages: root `pegainfer` and `pegainfer-kernels`.
+- `cargo metadata --no-deps --format-version 1` succeeded and showed both workspace packages: root `openinfer` and `openinfer-kernels`.
- `cargo fmt --all` applied formatting, then `cargo fmt --all --check` passed.
-- `PEGAINFER_CUDA_SM=120 cargo check --release` reached the `pegainfer-kernels` build script and failed at `nvcc` execution because this machine has no `nvcc`.
+- `OPENINFER_CUDA_SM=120 cargo check --release` reached the `openinfer-kernels` build script and failed at `nvcc` execution because this machine has no `nvcc`.
### Step 6: GPU release compile
- Avoided overwriting `` because that validation checkout has unrelated uncommitted work.
- Synced the local working tree to `` with `rsync`, excluding `.git/`, `target/`, `.venv/`, and `models/`.
-- Copied the existing validation FlashInfer submodule contents from `/third_party/flashinfer` into `crates/pegainfer-kernels/third_party/flashinfer` inside the build directory.
-- `PEGAINFER_CUDA_SM=120 cargo build --release` passed on the CUDA validation host. First pass exposed two Rust warnings from this split (`SamplingParams::is_greedy` unused and root `PrefillPagedPlan` visibility too wide); both were cleaned up.
-- Re-synced and reran `PEGAINFER_CUDA_SM=120 cargo build --release`; it passed in 14.16s with only build-script informational warnings.
-- `PEGAINFER_CUDA_SM=120 cargo test --release --no-run` passed in 12.28s and compiled all unit, binary, e2e, paged-attention, and regen test targets.
+- Copied the existing validation FlashInfer submodule contents from `/third_party/flashinfer` into `crates/openinfer-kernels/third_party/flashinfer` inside the build directory.
+- `OPENINFER_CUDA_SM=120 cargo build --release` passed on the CUDA validation host. First pass exposed two Rust warnings from this split (`SamplingParams::is_greedy` unused and root `PrefillPagedPlan` visibility too wide); both were cleaned up.
+- Re-synced and reran `OPENINFER_CUDA_SM=120 cargo build --release`; it passed in 14.16s with only build-script informational warnings.
+- `OPENINFER_CUDA_SM=120 cargo test --release --no-run` passed in 12.28s and compiled all unit, binary, e2e, paged-attention, and regen test targets.
### Step 7: GPU e2e and serving benchmark
- Ran Qwen3-4B e2e on the same validation build directory:
- - `PEGAINFER_CUDA_SM=120 PEGAINFER_TEST_MODEL_PATH= cargo test --release --test e2e -- --nocapture`
+ - `OPENINFER_CUDA_SM=120 OPENINFER_TEST_MODEL_PATH= cargo test --release --test e2e -- --nocapture`
- Result: pass, 1 test passed in 9.36s.
- Covered greedy golden outputs, multi-request generation, and consumer-drop scheduler survival.
- Ran the standard in-process serving snapshot:
- - `RUST_LOG=warn PEGAINFER_CUDA_SM=120 cargo run --release --bin bench_serving -- --model-path snapshot`
+ - `RUST_LOG=warn OPENINFER_CUDA_SM=120 cargo run --release --bin bench_serving -- --model-path snapshot`
- Result: pass.
- RTX 5090 Qwen3-4B snapshot:
- `prefill_heavy (10000,1)`: TTFT p50 `501.93ms`, p99 `503.75ms`.
@@ -87,17 +87,17 @@
- Ran local `cargo fmt --all --check`: pass.
- Ran local `cargo metadata --no-deps --format-version 1`: pass.
- Synced the current working tree to ``.
-- Ran `PEGAINFER_CUDA_SM=120 cargo clippy --release --all-targets -- -D warnings` on the CUDA validation host: pass in 1m42s.
+- Ran `OPENINFER_CUDA_SM=120 cargo clippy --release --all-targets -- -D warnings` on the CUDA validation host: pass in 1m42s.
### Unexpected
-- Local `cargo check --release` reached `pegainfer-kernels` build script but failed because this machine does not have `nvcc`; the user will provide a GPU build machine for compilation.
-- A second `cargo check --release -p pegainfer-kernels --lib` without `PEGAINFER_CUDA_SM` failed earlier at GPU SM detection, which is expected on this local machine without `nvidia-smi`.
+- Local `cargo check --release` reached `openinfer-kernels` build script but failed because this machine does not have `nvcc`; the user will provide a GPU build machine for compilation.
+- A second `cargo check --release -p openinfer-kernels --lib` without `OPENINFER_CUDA_SM` failed earlier at GPU SM detection, which is expected on this local machine without `nvidia-smi`.
- The validation checkout was dirty, so verification used a separate validation build directory instead of modifying that checkout.
- The validation build directory does not include `.git/`, so `bench_serving snapshot` reports `commit: unknown`.
## Debrief
-- **Outcome**: Implemented and validated the crate-first Phase 1 split. Kernel source, Triton source, FlashInfer submodule ownership, CUDA/Triton build script, FFI, kernel ABI tensor helpers, paged-KV layout metadata, and Qwen3-used Rust ops now live under `crates/pegainfer-kernels`. Root `pegainfer` keeps server/model code, `KvPool`, `PagePool`, `SamplingParams`, and thin compatibility adapters. The split passes local format/metadata checks, GPU release build/test-target compilation, release clippy, Qwen3-4B e2e, and the standard Qwen3-4B `bench_serving snapshot`.
+- **Outcome**: Implemented and validated the crate-first Phase 1 split. Kernel source, Triton source, FlashInfer submodule ownership, CUDA/Triton build script, FFI, kernel ABI tensor helpers, paged-KV layout metadata, and Qwen3-used Rust ops now live under `crates/openinfer-kernels`. Root `openinfer` keeps server/model code, `KvPool`, `PagePool`, `SamplingParams`, and thin compatibility adapters. The split passes local format/metadata checks, GPU release build/test-target compilation, release clippy, Qwen3-4B e2e, and the standard Qwen3-4B `bench_serving snapshot`.
- **Pitfalls encountered**:
- Root `src/ops/recurrent.rs` cannot be moved cleanly in this pass because it takes Qwen3.5's `GdrChunkwiseScratch35` type. Moving it would pull hybrid-model scratch ownership into the kernels crate, which is outside the Qwen3-4B Phase 1 scope.
- Initially moved `KvPool`, `PagePool`, and `SamplingParams` into the kernels crate. That was too broad; those belong to runtime policy and have been moved back to root.
@@ -106,6 +106,6 @@
- The kernel crate should own source and build artifacts physically, not only re-export copied Rust wrappers. Keeping `csrc/`, `tools/triton/`, and `third_party/flashinfer` in root creates exactly the duplicate context we are trying to remove.
- The human/LLM routing index belongs beside the kernels crate because it helps edit reusable kernels. Machine-readable model DAG manifests should not live there unless they are generated or validated; they belong with the model crate that owns the DAG.
- **Follow-ups**:
- - Phase 2 can extract the Qwen3 model crate on top of `pegainfer-kernels`.
+ - Phase 2 can extract the Qwen3 model crate on top of `openinfer-kernels`.
- In the Qwen3 model crate, define the model-owned kernel DAG and decide whether any TOML/JSON manifest is generated from Rust code, validated against wrappers, or avoided entirely in favor of trace IDs emitted directly from the executor.
- Run Qwen3.5 e2e separately on a box with `` if later changes touch the compatibility kernels or recurrent wrappers.
diff --git a/docs/models/qwen3/kv-pressure-hang.md b/docs/models/qwen3/kv-pressure-hang.md
index 3c3c1a76..40b5bc89 100644
--- a/docs/models/qwen3/kv-pressure-hang.md
+++ b/docs/models/qwen3/kv-pressure-hang.md
@@ -15,11 +15,11 @@
- `.codex/harness/README.md` - confirms the verification ladder and safety boundaries.
- `.codex/harness/commands.md` - provides Qwen3 e2e, server, and benchmark commands.
- `.codex/harness/verification.md` - classifies this as serving/scheduler behavior needing a narrow repro plus HTTP/benchmark evidence.
- - `pegainfer-qwen3-4b/src/scheduler.rs` - admission control currently defers requests under KV pressure.
- - `pegainfer-qwen3-4b/src/scheduler/plan.rs` - execution plans currently consume pending requests before failures are handled.
- - `pegainfer-qwen3-4b/src/scheduler/effects.rs` - successful finishes drop request state; scheduler execution errors do not.
- - `pegainfer-qwen3-4b/src/executor.rs` - `drop_request` is the existing owner API for releasing per-request KV state.
- - `pegainfer-core/src/kv_pool.rs` and `pegainfer-core/src/page_pool.rs` - KV pages are RAII-returned only when request state is dropped.
+ - `openinfer-qwen3-4b/src/scheduler.rs` - admission control currently defers requests under KV pressure.
+ - `openinfer-qwen3-4b/src/scheduler/plan.rs` - execution plans currently consume pending requests before failures are handled.
+ - `openinfer-qwen3-4b/src/scheduler/effects.rs` - successful finishes drop request state; scheduler execution errors do not.
+ - `openinfer-qwen3-4b/src/executor.rs` - `drop_request` is the existing owner API for releasing per-request KV state.
+ - `openinfer-core/src/kv_pool.rs` and `openinfer-core/src/page_pool.rs` - KV pages are RAII-returned only when request state is dropped.
- GitHub issue #85 - observed server stays alive but completions hang after QPS=2 KV pressure.
- **Relevant history**:
- `docs/subsystems/scheduler/scheduler.md` - QPS=2 varied workload is near capacity and already had some failed requests; the fix must handle pressure explicitly rather than claim higher throughput.
@@ -45,7 +45,7 @@
- decode errors surfacing as `TokenEvent::Error`, dropping request state, and allowing recovery;
- client/receiver drop releasing request state.
- Changed `DecodeEffect::EmitAndContinue` send-failure handling to call `drop_request` before retiring the active request.
-- Result: remote RTX 5090 `cargo test --release -p pegainfer-qwen3-4b --lib scheduler -- --nocapture` passed, `4 passed`.
+- Result: remote RTX 5090 `cargo test --release -p openinfer-qwen3-4b --lib scheduler -- --nocapture` passed, `4 passed`.
### Step 2: Maintainer feedback refinement
- The maintainer clarified that the basic fix should keep requests that cannot get KV allocation in the waiting queue; preemption can be deferred.
@@ -59,21 +59,21 @@
### Step 3: Build and static gates
- Remote environment:
- GPU: NVIDIA GeForce RTX 5090, driver `580.76.05`, 32607 MiB.
- - CUDA: `nvcc` `13.0.88`, `PEGAINFER_CUDA_SM=120`.
+ - CUDA: `nvcc` `13.0.88`, `OPENINFER_CUDA_SM=120`.
- Rust: `rustc 1.97.0-nightly (7c3c88f42 2026-05-14)`.
- Model: `models/Qwen3-4B`, HF revision metadata `1cfa9a7208912126459214e8b04321603b3df60c`.
- Commands:
- `cargo fmt --check` — passed.
- - `cargo test --release -p pegainfer-qwen3-4b --lib scheduler -- --nocapture` — passed, `4 passed`.
- - `cargo clippy --release -p pegainfer-qwen3-4b --lib -- -D warnings` — passed.
- - `cargo build --release -p pegainfer-server` — passed.
+ - `cargo test --release -p openinfer-qwen3-4b --lib scheduler -- --nocapture` — passed, `4 passed`.
+ - `cargo clippy --release -p openinfer-qwen3-4b --lib -- -D warnings` — passed.
+ - `cargo build --release -p openinfer-server` — passed.
- Local command:
- `~/.cargo/bin/cargo fmt --check` — passed.
### Step 4: E2E and serving pressure validation
- Installed `vllm 0.21.0` in the validation venv to run the issue's real `vllm bench serve` client.
- Ran a host-local exact e2e check against the validation model snapshot:
- - `PEGAINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p pegainfer-qwen3-4b --test e2e -- --nocapture`
+ - `OPENINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p openinfer-qwen3-4b --test e2e -- --nocapture`
- Result after local fixture regeneration for that model snapshot: passed, `1 passed`.
- PR review later found the regenerated fixture was not portable to the standard local model snapshot, so the repository `test_data/Qwen3-4B.json` change was reverted and this e2e result is not used as a merge gate.
- Ran a small issue-shaped benchmark first:
@@ -90,19 +90,19 @@
### Step 5: Compatibility fix encountered during validation
- Remote CUDA 13.0 initially failed with the existing `cudarc` `cuda-13010` feature because the driver/runtime lacked `cuDevSmResourceSplit`.
- Kept the workspace on `cuda-13010`; changing the shared `cudarc` feature would widen the PR's collaboration surface beyond issue #85.
-- Fixed `qwen3_decode_context` test-target compilation by linking `cudaProfilerStart/Stop` directly from `cudart`; the symbols were not exposed through `pegainfer_core::ffi`.
+- Fixed `qwen3_decode_context` test-target compilation by linking `cudaProfilerStart/Stop` directly from `cudart`; the symbols were not exposed through `openinfer_core::ffi`.
### Step 6: Final diff hygiene
- `git diff --check` — passed.
-- Confirmed the remote pegainfer server process was stopped after validation.
+- Confirmed the remote openinfer server process was stopped after validation.
### Step 7: Maintainer-style review follow-up
- Re-reviewed the changed scheduler and bridge paths after the main fix.
- Found one API-contract issue: `TokenEvent::Rejected` was being translated to vLLM `EngineCoreFinishReason::Stop`, which would make an impossible KV request look like an empty successful response.
-- Changed `pegainfer-server/src/vllm_frontend.rs` so `Rejected` maps to `EngineCoreFinishReason::Error` with the rejection message as `stop_reason`.
+- Changed `openinfer-server/src/vllm_frontend.rs` so `Rejected` maps to `EngineCoreFinishReason::Error` with the rejection message as `stop_reason`.
- Added `vllm_frontend::tests::rejected_request_is_reported_as_error`.
- Remote RTX 5090 command:
- - `cargo test --release -p pegainfer-server rejected_request_is_reported_as_error --lib` — passed, `1 passed`.
+ - `cargo test --release -p openinfer-server rejected_request_is_reported_as_error --lib` — passed, `1 passed`.
### Step 8: PR review comment follow-up
- Read PR #131 review comments from `gemini-code-assist`. The comments claimed the KV budget formulas should use `prompt_len + max_tokens` and `prompt_len + generated_count`.
diff --git a/docs/models/qwen3/model-crate.md b/docs/models/qwen3/model-crate.md
index 39636723..35403417 100644
--- a/docs/models/qwen3/model-crate.md
+++ b/docs/models/qwen3/model-crate.md
@@ -2,23 +2,23 @@
**Created**: 2026-05-03
**Status**: ready for diff review
-**TL;DR**: `crates/pegainfer-qwen3-4b` now owns Qwen3 config, weights, execution, scheduler, tests, benches, and kernel plan. Root `pegainfer` loads Qwen3 through a generic `EngineHandle` and no longer contains `Qwen3Model`, `Qwen3Executor`, `ModelRuntimeConfig`, root Qwen3 tests, or `src/model/qwen3/*`. The old `ModelForward` path has been removed; decode length-limit now emits the final token before `Finished`. Long-context `bs=1` TPOT was traced to non-partition FlashInfer paged decode under-filling the GPU; Qwen3 runtime gates FlashInfer split-K decode for `padded_bs<=2 && seq_len>=1024` and was retuned to `chunk_tokens=256,max_chunks=64`, cutting 4k/64 serving steady TPOT from about `11.7ms` to `6.46ms` on RTX 5090. Qwen3 now keeps a single model-crate bench entry: `qwen3_kernel_snapshot`, a JSON snapshot runner with warm/cold-L2 latency, default-on CUPTI counters, and compare. Correctness/truth is intentionally out of this snapshot for now.
+**TL;DR**: `crates/openinfer-qwen3-4b` now owns Qwen3 config, weights, execution, scheduler, tests, benches, and kernel plan. Root `openinfer` loads Qwen3 through a generic `EngineHandle` and no longer contains `Qwen3Model`, `Qwen3Executor`, `ModelRuntimeConfig`, root Qwen3 tests, or `src/model/qwen3/*`. The old `ModelForward` path has been removed; decode length-limit now emits the final token before `Finished`. Long-context `bs=1` TPOT was traced to non-partition FlashInfer paged decode under-filling the GPU; Qwen3 runtime gates FlashInfer split-K decode for `padded_bs<=2 && seq_len>=1024` and was retuned to `chunk_tokens=256,max_chunks=64`, cutting 4k/64 serving steady TPOT from about `11.7ms` to `6.46ms` on RTX 5090. Qwen3 now keeps a single model-crate bench entry: `qwen3_kernel_snapshot`, a JSON snapshot runner with warm/cold-L2 latency, default-on CUPTI counters, and compare. Correctness/truth is intentionally out of this snapshot for now.
## Preparation
- **Read**:
- `docs/index.md` - identified the kernels/core crate split and per-model boundary docs.
- - `docs/models/qwen3/kernels-crate.md` - Qwen3 kernel source/build ownership and human kernel index already live in `pegainfer-kernels`; model-owned DAG metadata should live with the model crate.
- - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - records the per-model engine direction and says root should be reusable frontend/control-plane infrastructure, not a universal model abstraction.
+ - `docs/models/qwen3/kernels-crate.md` - Qwen3 kernel source/build ownership and human kernel index already live in `openinfer-kernels`; model-owned DAG metadata should live with the model crate.
+ - `docs/subsystems/kernels/openinfer-kernels-boundary.md` - records the per-model engine direction and says root should be reusable frontend/control-plane infrastructure, not a universal model abstraction.
- `src/main.rs`, `src/lib.rs`, `src/server_engine.rs`, `src/scheduler.rs`, `src/model_executor.rs`, `src/model/qwen3/*`, `src/bin/bench_serving.rs`, and Qwen3 tests - mapped what root currently knows about Qwen3.
- **Relevant history**:
- The earlier shared-runtime work (now consolidated into `docs/subsystems/runtime/runtime.md`) was a useful simplification, but the next boundary should not make `ModelForward` the long-term universal engine API.
- **Plan**:
1. Define the model crate/root interface before moving code.
- 2. Move the generic text-generation handle/request/event types into `pegainfer-core` so root and model crates can communicate without model crates depending on root.
- 3. Create `crates/pegainfer-qwen3-4b` and move Qwen3 config, weights, forward paths, decode buffers, `Qwen3Executor`, Qwen3 scheduler internals, Qwen3 correctness tests, and Qwen3-specific benches into it.
- 4. Keep root `pegainfer` as frontend plus model registry. The registry can know crate names, but `main`, `vllm_frontend`, and generic benchmark code should only see `EngineHandle`, `ModelInfo`, and tokenizer path.
- 5. Add a model-owned `kernel_plan.rs` in the Qwen3 crate as the LLM/human index from model DAG phases to reusable kernels. Do not add a hand-maintained public TOML in `pegainfer-kernels`.
+ 2. Move the generic text-generation handle/request/event types into `openinfer-core` so root and model crates can communicate without model crates depending on root.
+ 3. Create `crates/openinfer-qwen3-4b` and move Qwen3 config, weights, forward paths, decode buffers, `Qwen3Executor`, Qwen3 scheduler internals, Qwen3 correctness tests, and Qwen3-specific benches into it.
+ 4. Keep root `openinfer` as frontend plus model registry. The registry can know crate names, but `main`, `vllm_frontend`, and generic benchmark code should only see `EngineHandle`, `ModelInfo`, and tokenizer path.
+ 5. Add a model-owned `kernel_plan.rs` in the Qwen3 crate as the LLM/human index from model DAG phases to reusable kernels. Do not add a hand-maintained public TOML in `openinfer-kernels`.
6. Verify locally with format/metadata, then on the CUDA validation host with release build, clippy, Qwen3 crate e2e, and root `bench_serving snapshot`. Keep microbench timing in Criterion benches instead of duplicating it as a test.
- **Risks / open questions**:
- If the scheduler stays in root, root still knows Qwen3's execution shape. To meet the stated goal, the Qwen3 scheduler should move into the Qwen3 crate and expose only a generic handle.
@@ -30,7 +30,7 @@
The root-visible interface should be request/response oriented, not prefill/decode oriented.
```rust
-// pegainfer-core
+// openinfer-core
pub struct EngineLoadOptions {
pub enable_cuda_graph: bool,
pub device_ordinals: Vec,
@@ -65,7 +65,7 @@ pub struct EngineHandle {
```
```rust
-// pegainfer-qwen3-4b
+// openinfer-qwen3-4b
pub fn probe_model(model_path: &std::path::Path) -> anyhow::Result