openinfer-project · CAICAIIs · Jul 2, 2026 · Jul 2, 2026 · Jul 3, 2026
diff --git a/docs/index.md b/docs/index.md
@@ -43,9 +43,10 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
 
 | Path | TL;DR |
 | --- | --- |
-| `models/qwen35/roadmap.md` | Qwen3.5-4B roadmap (2026-06 review): decode-tuning refresh improves direct TPOT by 2-3%, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Open items: HND prefill staging, prefix-cache design, serving concurrency. |
+| `models/qwen35/roadmap.md` | Qwen3.5-4B roadmap: DFlash is opt-in and batched for multi-active decode, with RTX 5090 in-process c4/c8/c16 gains across decode-heavy, medium, and long prompts. Open items: HTTP pressure validation for DFlash, HND prefill staging, prefix-cache design, and broader feature coverage. |
 | `models/qwen35/kv-admission.md` | Issue #254 complete: Qwen3.5 now uses full-lifetime KV admission, deferred pressure handling, impossible-request rejection, explicit error semantics, direct rejection-event coverage, RTX 5090 e2e, and real HTTP pressure/post-pressure validation. |
 | `models/qwen35/optimization.md` | Hybrid 24 linear + 8 full attn optimization ledger. Decode-tuning refresh fuses MLP gate/up and tunes decode cublasLt buckets, improving direct TPOT by 2-3%; vLLM still leads 1024/256 HTTP decode. |
+| `models/qwen35/dflash-speculative-decoding.md` | Qwen3.5 DFlash speculative decoding is opt-in and batched: hybrid-state transaction covers KV + recurrent + conv state, correctness gates pass, and RTX 5090 in-process c4/c8/c16 A/B improves `prompt_len=1` by `+16.7%/+15.4%/+14.0%`, `1024` by `+209.9%/+168.6%/+45.3%`, and `4096` by `+135.9%/+35.7%/+25.6%`. |
 | `models/qwen35/accuracy.md` | Qwen3.5-4B HF bf16 logits goldens through `past_key_values`: short replay covers sequential graph, bucket-straddling batched graph, and slot-compaction; long replay covers 4097/8192-token prompts; full GSM8K 8-shot now matches the HF baseline within 0.15 percentage points. |
 | `models/qwen35/model-crate.md` | `openinfer-qwen35-4b` owns Qwen3.5 model/scheduler/recurrent ops/tests/benches; feature-gated behind `qwen35-4b` (Triton AOT is the only Python build dependency); root loads it through `EngineHandle`. Build/check/clippy, root bench sanity check, historical Qwen3.5 e2e, and scheduler e2e records live here. |
 | `models/qwen35/kernel-plan.md` | Qwen3.5-4B has a `openinfer_qwen35_4b::kernel_plan()` static descriptor mirroring the qwen3 module — enumerates every prefill/decode/unified op with its Rust call site, backend, and notes, so you can dump the active kernel mix without reading call sites. Pure refactor (issue #256), no kernel behavior change. |

diff --git a/docs/models/qwen35/dflash-speculative-decoding.md b/docs/models/qwen35/dflash-speculative-decoding.md
@@ -0,0 +1,90 @@
+# DFlash Speculative Decoding (Qwen3.5-4B)
+
+> **TL;DR:** Qwen3.5-4B DFlash speculative decoding is implemented behind `--dflash-draft-model-path`, default-off, greedy-only, single-GPU only, and now supports multi-active decode batches with a fixed-buffer batched verifier. Same-host RTX 5090 A/B on `output_len=256` shows clear throughput wins at c4/c8/c16: decode-heavy `prompt_len=1` improves `+16.7%/+15.4%/+14.0%`, medium `prompt_len=1024` improves `+209.9%/+168.6%/+45.3%`, and long `prompt_len=4096` improves `+135.9%/+35.7%/+25.6%`.
+
+Last touched: 2026-07
+
+## How To Enable
+
+Use a Qwen3.5 target model with a matching DFlash draft checkpoint:
+
+```bash
+cargo run --release -p openinfer-server --features qwen35-4b -- \
+  --model-path <Qwen3.5-4B> \
+  --dflash-draft-model-path <Qwen3.5-4B-DFlash>
+```
+
+The flag is rejected for unsupported model lines. Qwen3.5 DFlash is incompatible with LoRA, KV offload, tensor parallelism, and decode-overlap modes. Non-greedy requests and logprobs use normal decode.
+
+## Runtime Contract
+
+- The drafter emits `[current_token, draft...]`; the target verifies that span and commits the longest greedy-matching prefix plus one bonus token.
+- Verification uses preallocated `VerifyBuffers35` storage for token ids, hidden/logit buffers, GDR scratch, full-attention scratch, paged prefill plans, and sampling scratch. Decode steps reuse fixed buffers instead of allocating on the hot path.
+- Qwen3.5 verification is a hybrid transaction over full-attention KV, recurrent state, convolution state, and sequence length. Verify writes to scratch state; commit preserves full-span accepts directly and replays only truncated accepted spans after rolling KV back to the canonical boundary.
+- Batched verify handles active batches up to the scheduler bucket size. Complete fixed shapes can use captured graph-compatible paths; truncated or heterogeneous spans use eager verify.
+- The scheduler captures target hidden context only on DFlash-eligible prefill paths. If a request falls back to normal decode, its DFlash state is dropped because normal decode does not capture the hidden context needed by the drafter.
+- Per-request low-acceptance statistics disable DFlash after enough poor draft tokens, so incompatible prompts return to baseline decode.
+- DFlash reserves memory for draft weights, draft KV/cache, verify buffers, and batch scratch before target KV sizing. Admission also reserves draft block headroom, so a near-window request accepted without DFlash can be rejected when DFlash is enabled.
+
+## Validation
+
+Commands below passed on an RTX 5090 validation host with driver `580.105.08`, CUDA 13.3, Triton Python `3.7.1`, and `OPENINFER_CUDA_SM=120`. The source snapshot is the PR branch after the benchmark-shaped gate cleanup.
+
+```bash
+cargo fmt --all --check
+git diff --check
+OPENINFER_TRITON_PYTHON=<triton-python> OPENINFER_TEST_MODEL_PATH=<Qwen3.5-4B> \
+  OPENINFER_DFLASH_TEST_MODEL_PATH=<Qwen3.5-4B-DFlash> \
+  cargo test --release -p openinfer-qwen35-4b --features qwen35-4b \
+  --test dflash_speculative_gate -- --nocapture --test-threads=1
+OPENINFER_TRITON_PYTHON=<triton-python> OPENINFER_TEST_MODEL_PATH=<Qwen3.5-4B> \
+  OPENINFER_DFLASH_TEST_MODEL_PATH=<Qwen3.5-4B-DFlash> \
+  cargo test --release -p openinfer-qwen35-4b --features qwen35-4b \
+  --test speculative_verify -- --nocapture --test-threads=1
+OPENINFER_TRITON_PYTHON=<triton-python> OPENINFER_TEST_MODEL_PATH=<Qwen3.5-4B> \
+  cargo test --release -p openinfer-qwen35-4b --features qwen35-4b \
+  --test hf_golden_gate -- --nocapture
+OPENINFER_TRITON_PYTHON=<triton-python> OPENINFER_TEST_MODEL_PATH=<Qwen3.5-4B> \
+  cargo test --release -p openinfer-qwen35-4b --features qwen35-4b \
+  --test e2e_scheduler -- --nocapture
+```
+
+The DFlash scheduler gates check single request, multi-active batch, heterogeneous `max_tokens`, mixed concurrent requests, and the benchmark-shaped synthetic cases that exposed hash differences in the raw sweep (`1024/c16`, `4096/c8`, `4096/c16`). The benchmark-shaped follow-up passed: `1024/c16` was exact for 16/16 requests; `4096/c8` and `4096/c16` were exact except for near-ties accepted by the regret oracle (`regret 0.000` and `0.125 <= 0.20`).
+
+## Benchmark
+
+Same host, same PR branch snapshot, in-process `bench_serving request`, greedy synthetic distinct prompts, `output_len=256`, warmup `3`, iters `8`.
+
+| Prompt | Concurrency | Baseline tok/s | DFlash tok/s | Delta | Baseline effective TPOT p50 | DFlash effective TPOT p50 | Baseline raw ITL p99 | DFlash raw ITL p99 | Baseline TTFT p50 | DFlash TTFT p50 |
+| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| 1 | 1 | 151.214 | 149.808 | -0.9% | 6.593 ms | 6.645 ms | 6.682 ms | 6.699 ms | 9.122 ms | 9.374 ms |
+| 1 | 4 | 110.906 | 129.388 | +16.7% | 8.907 ms | 8.682 ms | 8.988 ms | 21.412 ms | 39.045 ms | 32.756 ms |
+| 1 | 8 | 89.977 | 103.856 | +15.4% | 10.889 ms | 9.679 ms | 10.969 ms | 33.776 ms | 69.832 ms | 63.528 ms |
+| 1 | 16 | 64.930 | 73.990 | +14.0% | 14.925 ms | 13.851 ms | 15.054 ms | 57.610 ms | 131.570 ms | 125.421 ms |
+| 1024 | 1 | 135.543 | 134.699 | -0.6% | 7.220 ms | 7.270 ms | 7.295 ms | 7.297 ms | 46.715 ms | 46.797 ms |
+| 1024 | 4 | 97.293 | 301.482 | +209.9% | 9.911 ms | 2.980 ms | 9.695 ms | 19.231 ms | 153.577 ms | 137.400 ms |
+| 1024 | 8 | 76.916 | 206.606 | +168.6% | 12.217 ms | 4.062 ms | 54.032 ms | 27.162 ms | 263.906 ms | 229.206 ms |
+| 1024 | 16 | 53.404 | 77.602 | +45.3% | 16.963 ms | 11.597 ms | 60.353 ms | 52.408 ms | 492.746 ms | 414.550 ms |
+| 4096 | 1 | 102.477 | 101.745 | -0.7% | 9.039 ms | 9.122 ms | 9.139 ms | 9.134 ms | 189.535 ms | 189.955 ms |
+| 4096 | 4 | 68.916 | 162.581 | +135.9% | 12.830 ms | 4.635 ms | 57.351 ms | 22.665 ms | 640.507 ms | 567.550 ms |
+| 4096 | 8 | 50.473 | 68.502 | +35.7% | 16.238 ms | 11.677 ms | 61.075 ms | 35.653 ms | 1106.224 ms | 941.275 ms |
+| 4096 | 16 | 32.875 | 41.304 | +25.6% | 23.239 ms | 17.710 ms | 65.017 ms | 63.604 ms | 2070.313 ms | 1696.621 ms |
+
+`effective_tpot_ms` is the amortized per-request decode time. Raw token-event ITL can spike under speculative decode because accepted spans emit multiple tokens in one scheduler step; keep both metrics visible when reviewing tails.
+
+## Profile
+
+Profiles used `nsys profile --trace=cuda,nvtx,osrt --cuda-graph-trace=node` and `nsys stats` on the same host. The final c8/c16 traces show that the previous per-request verifier bottleneck is gone: DFlash uses batched prefill verify kernels and partial-only replay instead of singleton target-prefill verification.
+
+| Shape | Baseline dominant work | DFlash dominant work | Profile conclusion |
+| --- | --- | --- | --- |
+| `prompt=1,c=8` | `gated_delta_rule_decode_kernel` `2.04s`, batch decode attention `72.6ms` | GDR verify kernels plus lower target decode counts; batch decode attention `71.2ms` | Draft/verify overhead is below the throughput saved by multi-token accepts. |
+| `prompt=1024,c=8` | `gated_delta_rule_decode_kernel` `2.06s`, batch decode attention `550.2ms` | GDR verify kernels, `SinglePrefillWithKVCacheKernel` `75.3ms`, batch prefill verify `49.6ms`, batch decode attention `71.8ms` | Verifier no longer runs target prefill per request; c8 decode throughput improves `+192.90%`. |
+| `prompt=4096,c=16` | `gated_delta_rule_decode_kernel` `4.41s`, batch decode attention `2.44s`, batch prefill `398.5ms` | batch decode attention `1.68s`, batch prefill verify `537.9ms`, GDR verify kernels visible but not dominant | Commit/replay/copy is not the leading bottleneck; long c16 still improves `+24.89%`. |
+
+## Claim Boundaries
+
+- This is an opt-in Qwen3.5 DFlash path with real c4/c8/c16 in-process benchmark wins. Token sanity is exact where stable; prompt-length-1 and a few long high-concurrency synthetic cases are covered by the same regret oracle used by the scheduler gate for bf16 near-tie / prefill-vs-decode boundary flips.
+- The performance table is in-process benchmark evidence. Do not read it as an HTTP serving pressure claim.
+- Single-concurrency random synthetic prompts remain flat to slightly slower. The multi-active path is the supported performance claim for this slice.
+- Multi-GPU, LoRA, KV offload, decode overlap, non-greedy sampling, and logprobs intentionally use normal decode or fail closed.
diff --git a/docs/models/qwen35/roadmap.md b/docs/models/qwen35/roadmap.md
@@ -1,8 +1,8 @@
 # Qwen3.5-4B Roadmap
 
-> **TL;DR:** Qwen3.5-4B is decode-correct and still improving: the decode-tuning refresh improves direct TPOT by `2.1-3.2%`, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Long-prompt HF logits and GSM8K gates cover the old 4096-position RoPE boundary. Remaining structural items are HND prefill staging, prefix-cache design, and the serving-level concurrency gap.
+> **TL;DR:** Qwen3.5-4B is decode-correct and still improving: DFlash speculative decoding is now opt-in, default-off, and batched for multi-active decode. Same-host RTX 5090 A/B shows c4/c8/c16 gains for decode-heavy, medium, and long prompts; token sanity is exact where stable and covered by the DFlash regret oracle for bf16 near-tie cases. Remaining structural items are HND prefill staging, prefix-cache design, serving-level HTTP pressure validation for DFlash, and broader non-greedy/TP feature coverage.
 >
-> **Last touched:** 2026-06
+> **Last touched:** 2026-07
 
 Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc: `docs/models/qwen3/roadmap.md` — batched sampling is shared and #284 now routes Qwen3.5 decode through the same compact batched sampler; Qwen3.5 now has its own model-level non-greedy behavior gate, while qwen3 keeps the sibling gate on its side.
 
@@ -20,6 +20,7 @@ Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc:
 | Admission | ✓ existing full-lifetime KV admission and explicit `Rejected` events cover impossible KV requests; #253 adds the context-window rejection reason before prefill/decode | `scheduler.rs`, `src/scheduler/plan.rs`, `docs/models/qwen35/kv-admission.md` |
 | Scheduler tests | Partial: current plan selection, full-lifetime admission, context-window rejection, slot assignment, and slot-compaction decisions are CPU-tested; GPU execution remains coupled to the production scheduler | `src/scheduler/plan.rs` |
 | Step tail | Local branch verified: #353 batches the prefill final norm/lm_head tail, samples decode/unified rows from batched logits, and keeps host full-vocab copies only for requested logprobs; HF/e2e gates pass, short-output serving A/B shows TTFT benefit, long-decode TPOT remains a no-claim diagnostic | `docs/models/qwen35/batched-step-tail.md` |
+| DFlash speculative decode | Opt-in batched path: hybrid-state verify/commit covers KV + recurrent + conv state; correctness and scheduler e2e gates pass. Same-host in-process A/B improves `prompt_len=1` c4/c8/c16 by `+16.7%/+15.4%/+14.0%`, `prompt_len=1024` by `+209.9%/+168.6%/+45.3%`, and `prompt_len=4096` by `+135.9%/+35.7%/+25.6%`. | `docs/models/qwen35/dflash-speculative-decoding.md` |
 | TP | ✗ absent (single GPU only) | — |
 | Prefix cache | ✗ absent; recurrent GDR state (~48MB per boundary snapshot) makes "prefix hit" itself a design question | — |
 
@@ -34,10 +35,11 @@ Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc:
 
 ### Next
 
-5. **Prefill full-paged migration.** Replace the HND staging copy with direct paged writes: removes the ~640MB transient and the extra D2D pass. Chain dependency: paged-direct prefill → per-token position plumbing → RoPE/context-window invariants → opens the door to prefix-cache design.
-6. **Serving-level concurrency profiling.** Add a measured-only server-side range, then split the 1024/256 concurrency-16 gap across scheduler wait, event sync, request dispatch, and model execution. Also teach the Qwen3.5 direct decode bench to prove cached-token exclusion before it reports pure decode TPOT.
-7. **Scheduler logic seam follow-through.** The current admission/slot/compaction decisions have a CPU-tested seam. Keep future admission and rejection changes in that seam instead of re-embedding them in GPU execution.
-8. **Prefix-cache design note.** Linear-attention layers carry recurrent state, not KV blocks — a "prefix hit" must restore both the full-attention KV *and* a recurrent-state snapshot at a block boundary (~48MB per boundary at bf16). Whether to snapshot per block, per N blocks, or only at request end is an open trade; write the design note before any code. Depends on 5.
+5. **DFlash serving pressure validation.** The batched in-process DFlash path is now positive for c4/c8/c16. Next evidence step is HTTP/OpenAI-compatible pressure with the same baseline-vs-DFlash contract, including completed/failed counts, TTFT, effective TPOT, raw ITL, and token hash sanity.
+6. **Prefill full-paged migration.** Replace the HND staging copy with direct paged writes: removes the ~640MB transient and the extra D2D pass. Chain dependency: paged-direct prefill → per-token position plumbing → RoPE/context-window invariants → opens the door to prefix-cache design.
+7. **Serving-level concurrency profiling.** Add a measured-only server-side range, then split the 1024/256 concurrency-16 gap across scheduler wait, event sync, request dispatch, and model execution. Also teach the Qwen3.5 direct decode bench to prove cached-token exclusion before it reports pure decode TPOT.
+8. **Scheduler logic seam follow-through.** The current admission/slot/compaction decisions have a CPU-tested seam. Keep future admission and rejection changes in that seam instead of re-embedding them in GPU execution.
+9. **Prefix-cache design note.** Linear-attention layers carry recurrent state, not KV blocks — a "prefix hit" must restore both the full-attention KV *and* a recurrent-state snapshot at a block boundary (~48MB per boundary at bf16). Whether to snapshot per block, per N blocks, or only at request end is an open trade; write the design note before any code. Depends on 6.
 
 ### Later
 

diff --git a/openinfer-core/src/kv_pool.rs b/openinfer-core/src/kv_pool.rs
@@ -213,6 +213,20 @@ impl KvState {
         self.seq_len += count;
     }
 
+    /// Roll this request's logical KV length back to `token_count`, returning
+    /// any now-unused tail pages to the pool.
+    pub fn truncate_to(&mut self, token_count: usize) -> Result<()> {
+        anyhow::ensure!(
+            token_count <= self.seq_len,
+            "KvState cannot truncate from {} up to {token_count}",
+            self.seq_len
+        );
+        let needed = pages_needed(token_count, self.pool.inner.layout.page_size);
+        self.permit.truncate(needed);
+        self.seq_len = token_count;
+        Ok(())
+    }
+
     /// Build kernel-facing metadata for this request's KV.
     pub fn desc(&self) -> KvDesc<'_> {
         KvDesc {
@@ -348,12 +362,39 @@ mod tests {
         assert_eq!(desc.last_page_len(), 1);
         assert_eq!(pool.available_pages(), 2);
 
+        // Truncate back into the first page: tail page returns immediately.
+        kv.truncate_to(15).unwrap();
+        assert_eq!(kv.seq_len(), 15);
+        let desc = kv.desc();
+        assert_eq!(desc.num_pages(), 1);
+        assert_eq!(desc.last_page_len(), 15);
+        assert_eq!(pool.available_pages(), 3);
+
+        // Truncate to zero releases all request pages.
+        kv.truncate_to(0).unwrap();
+        assert_eq!(kv.seq_len(), 0);
+        assert_eq!(kv.desc().num_pages(), 0);
+        assert_eq!(pool.available_pages(), 4);
+
         // Reset returns all pages
+        kv.ensure_capacity(17).unwrap();
+        kv.advance(17);
         kv.reset();
         assert_eq!(kv.seq_len(), 0);
         assert_eq!(pool.available_pages(), 4);
     }
 
+    #[test]
+    fn kv_state_rejects_truncate_forward() {
+        let pool = test_pool(16, 3);
+        let mut kv = pool.alloc();
+        kv.ensure_capacity(4).unwrap();
+        kv.advance(4);
+
+        let err = kv.truncate_to(5).unwrap_err().to_string();
+        assert!(err.contains("cannot truncate from 4 up to 5"));
+    }
+
     #[test]
     fn kv_state_out_of_pages() {
         // 3 pages total: 1 padding, 2 available → 32 tokens max

diff --git a/openinfer-core/src/ops.rs b/openinfer-core/src/ops.rs
@@ -25,7 +25,7 @@ pub use openinfer_kernels::ops::{
     rms_norm_gated_batch_into, rms_norm_into, rms_norm_offset_into, scale_f32_in_place,
     scaled_add_batch_into, scaled_add_rows_indexed_into, scaled_add_rows_into,
     scaled_add_rows_token_range_into, silu_mul_batch, silu_mul_batch_into,
-    single_prefill_nhd_noncausal_into, write_vec_into,
+    single_prefill_nhd_causal_window_into, single_prefill_nhd_noncausal_into, write_vec_into,
 };
 #[cfg(not(feature = "kernel-call-trace"))]
 pub use openinfer_kernels::ops::{