Skip to content
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l

| Path | TL;DR |
| --- | --- |
| `models/qwen35/roadmap.md` | Qwen3.5-4B roadmap (2026-06 review): decode-tuning refresh improves direct TPOT by 2-3%, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Open items: HND prefill staging, prefix-cache design, serving concurrency. |
| `models/qwen35/roadmap.md` | Qwen3.5-4B roadmap: DFlash is opt-in and batched for multi-active decode, with RTX 5090 in-process c4/c8/c16 gains across decode-heavy, medium, and long prompts. Open items: HTTP pressure validation for DFlash, HND prefill staging, prefix-cache design, and broader feature coverage. |
| `models/qwen35/kv-admission.md` | Issue #254 complete: Qwen3.5 now uses full-lifetime KV admission, deferred pressure handling, impossible-request rejection, explicit error semantics, direct rejection-event coverage, RTX 5090 e2e, and real HTTP pressure/post-pressure validation. |
| `models/qwen35/optimization.md` | Hybrid 24 linear + 8 full attn optimization ledger. Decode-tuning refresh fuses MLP gate/up and tunes decode cublasLt buckets, improving direct TPOT by 2-3%; vLLM still leads 1024/256 HTTP decode. |
| `models/qwen35/dflash-speculative-decoding.md` | Qwen3.5 DFlash speculative decoding is opt-in and batched: hybrid-state transaction covers KV + recurrent + conv state, correctness gates pass, and RTX 5090 in-process c4/c8/c16 A/B improves `prompt_len=1` by `+16.7%/+15.4%/+14.0%`, `1024` by `+209.9%/+168.6%/+45.3%`, and `4096` by `+135.9%/+35.7%/+25.6%`. |
| `models/qwen35/accuracy.md` | Qwen3.5-4B HF bf16 logits goldens through `past_key_values`: short replay covers sequential graph, bucket-straddling batched graph, and slot-compaction; long replay covers 4097/8192-token prompts; full GSM8K 8-shot now matches the HF baseline within 0.15 percentage points. |
| `models/qwen35/model-crate.md` | `openinfer-qwen35-4b` owns Qwen3.5 model/scheduler/recurrent ops/tests/benches; feature-gated behind `qwen35-4b` (Triton AOT is the only Python build dependency); root loads it through `EngineHandle`. Build/check/clippy, root bench sanity check, historical Qwen3.5 e2e, and scheduler e2e records live here. |
| `models/qwen35/kernel-plan.md` | Qwen3.5-4B has a `openinfer_qwen35_4b::kernel_plan()` static descriptor mirroring the qwen3 module — enumerates every prefill/decode/unified op with its Rust call site, backend, and notes, so you can dump the active kernel mix without reading call sites. Pure refactor (issue #256), no kernel behavior change. |
Expand Down
90 changes: 90 additions & 0 deletions docs/models/qwen35/dflash-speculative-decoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# DFlash Speculative Decoding (Qwen3.5-4B)

> **TL;DR:** Qwen3.5-4B DFlash speculative decoding is implemented behind `--dflash-draft-model-path`, default-off, greedy-only, single-GPU only, and now supports multi-active decode batches with a fixed-buffer batched verifier. Same-host RTX 5090 A/B on `output_len=256` shows clear throughput wins at c4/c8/c16: decode-heavy `prompt_len=1` improves `+16.7%/+15.4%/+14.0%`, medium `prompt_len=1024` improves `+209.9%/+168.6%/+45.3%`, and long `prompt_len=4096` improves `+135.9%/+35.7%/+25.6%`.

Last touched: 2026-07

## How To Enable

Use a Qwen3.5 target model with a matching DFlash draft checkpoint:

```bash
cargo run --release -p openinfer-server --features qwen35-4b -- \
--model-path <Qwen3.5-4B> \
--dflash-draft-model-path <Qwen3.5-4B-DFlash>
```

The flag is rejected for unsupported model lines. Qwen3.5 DFlash is incompatible with LoRA, KV offload, tensor parallelism, and decode-overlap modes. Non-greedy requests and logprobs use normal decode.

## Runtime Contract

- The drafter emits `[current_token, draft...]`; the target verifies that span and commits the longest greedy-matching prefix plus one bonus token.
- Verification uses preallocated `VerifyBuffers35` storage for token ids, hidden/logit buffers, GDR scratch, full-attention scratch, paged prefill plans, and sampling scratch. Decode steps reuse fixed buffers instead of allocating on the hot path.
- Qwen3.5 verification is a hybrid transaction over full-attention KV, recurrent state, convolution state, and sequence length. Verify writes to scratch state; commit preserves full-span accepts directly and replays only truncated accepted spans after rolling KV back to the canonical boundary.
- Batched verify handles active batches up to the scheduler bucket size. Complete fixed shapes can use captured graph-compatible paths; truncated or heterogeneous spans use eager verify.
- The scheduler captures target hidden context only on DFlash-eligible prefill paths. If a request falls back to normal decode, its DFlash state is dropped because normal decode does not capture the hidden context needed by the drafter.
- Per-request low-acceptance statistics disable DFlash after enough poor draft tokens, so incompatible prompts return to baseline decode.
- DFlash reserves memory for draft weights, draft KV/cache, verify buffers, and batch scratch before target KV sizing. Admission also reserves draft block headroom, so a near-window request accepted without DFlash can be rejected when DFlash is enabled.

## Validation

Commands below passed on an RTX 5090 validation host with driver `580.105.08`, CUDA 13.3, Triton Python `3.7.1`, and `OPENINFER_CUDA_SM=120`. The source snapshot is the PR branch after the benchmark-shaped gate cleanup.

```bash
cargo fmt --all --check
git diff --check
OPENINFER_TRITON_PYTHON=<triton-python> OPENINFER_TEST_MODEL_PATH=<Qwen3.5-4B> \
OPENINFER_DFLASH_TEST_MODEL_PATH=<Qwen3.5-4B-DFlash> \
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b \
--test dflash_speculative_gate -- --nocapture --test-threads=1
OPENINFER_TRITON_PYTHON=<triton-python> OPENINFER_TEST_MODEL_PATH=<Qwen3.5-4B> \
OPENINFER_DFLASH_TEST_MODEL_PATH=<Qwen3.5-4B-DFlash> \
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b \
--test speculative_verify -- --nocapture --test-threads=1
OPENINFER_TRITON_PYTHON=<triton-python> OPENINFER_TEST_MODEL_PATH=<Qwen3.5-4B> \
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b \
--test hf_golden_gate -- --nocapture
OPENINFER_TRITON_PYTHON=<triton-python> OPENINFER_TEST_MODEL_PATH=<Qwen3.5-4B> \
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b \
--test e2e_scheduler -- --nocapture
```

The DFlash scheduler gates check single request, multi-active batch, heterogeneous `max_tokens`, mixed concurrent requests, and the benchmark-shaped synthetic cases that exposed hash differences in the raw sweep (`1024/c16`, `4096/c8`, `4096/c16`). The benchmark-shaped follow-up passed: `1024/c16` was exact for 16/16 requests; `4096/c8` and `4096/c16` were exact except for near-ties accepted by the regret oracle (`regret 0.000` and `0.125 <= 0.20`).

## Benchmark

Same host, same PR branch snapshot, in-process `bench_serving request`, greedy synthetic distinct prompts, `output_len=256`, warmup `3`, iters `8`.

| Prompt | Concurrency | Baseline tok/s | DFlash tok/s | Delta | Baseline effective TPOT p50 | DFlash effective TPOT p50 | Baseline raw ITL p99 | DFlash raw ITL p99 | Baseline TTFT p50 | DFlash TTFT p50 |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 1 | 1 | 151.214 | 149.808 | -0.9% | 6.593 ms | 6.645 ms | 6.682 ms | 6.699 ms | 9.122 ms | 9.374 ms |
| 1 | 4 | 110.906 | 129.388 | +16.7% | 8.907 ms | 8.682 ms | 8.988 ms | 21.412 ms | 39.045 ms | 32.756 ms |
| 1 | 8 | 89.977 | 103.856 | +15.4% | 10.889 ms | 9.679 ms | 10.969 ms | 33.776 ms | 69.832 ms | 63.528 ms |
| 1 | 16 | 64.930 | 73.990 | +14.0% | 14.925 ms | 13.851 ms | 15.054 ms | 57.610 ms | 131.570 ms | 125.421 ms |
| 1024 | 1 | 135.543 | 134.699 | -0.6% | 7.220 ms | 7.270 ms | 7.295 ms | 7.297 ms | 46.715 ms | 46.797 ms |
| 1024 | 4 | 97.293 | 301.482 | +209.9% | 9.911 ms | 2.980 ms | 9.695 ms | 19.231 ms | 153.577 ms | 137.400 ms |
| 1024 | 8 | 76.916 | 206.606 | +168.6% | 12.217 ms | 4.062 ms | 54.032 ms | 27.162 ms | 263.906 ms | 229.206 ms |
| 1024 | 16 | 53.404 | 77.602 | +45.3% | 16.963 ms | 11.597 ms | 60.353 ms | 52.408 ms | 492.746 ms | 414.550 ms |
| 4096 | 1 | 102.477 | 101.745 | -0.7% | 9.039 ms | 9.122 ms | 9.139 ms | 9.134 ms | 189.535 ms | 189.955 ms |
| 4096 | 4 | 68.916 | 162.581 | +135.9% | 12.830 ms | 4.635 ms | 57.351 ms | 22.665 ms | 640.507 ms | 567.550 ms |
| 4096 | 8 | 50.473 | 68.502 | +35.7% | 16.238 ms | 11.677 ms | 61.075 ms | 35.653 ms | 1106.224 ms | 941.275 ms |
| 4096 | 16 | 32.875 | 41.304 | +25.6% | 23.239 ms | 17.710 ms | 65.017 ms | 63.604 ms | 2070.313 ms | 1696.621 ms |

`effective_tpot_ms` is the amortized per-request decode time. Raw token-event ITL can spike under speculative decode because accepted spans emit multiple tokens in one scheduler step; keep both metrics visible when reviewing tails.

## Profile

Profiles used `nsys profile --trace=cuda,nvtx,osrt --cuda-graph-trace=node` and `nsys stats` on the same host. The final c8/c16 traces show that the previous per-request verifier bottleneck is gone: DFlash uses batched prefill verify kernels and partial-only replay instead of singleton target-prefill verification.

| Shape | Baseline dominant work | DFlash dominant work | Profile conclusion |
| --- | --- | --- | --- |
| `prompt=1,c=8` | `gated_delta_rule_decode_kernel` `2.04s`, batch decode attention `72.6ms` | GDR verify kernels plus lower target decode counts; batch decode attention `71.2ms` | Draft/verify overhead is below the throughput saved by multi-token accepts. |
| `prompt=1024,c=8` | `gated_delta_rule_decode_kernel` `2.06s`, batch decode attention `550.2ms` | GDR verify kernels, `SinglePrefillWithKVCacheKernel` `75.3ms`, batch prefill verify `49.6ms`, batch decode attention `71.8ms` | Verifier no longer runs target prefill per request; c8 decode throughput improves `+192.90%`. |
| `prompt=4096,c=16` | `gated_delta_rule_decode_kernel` `4.41s`, batch decode attention `2.44s`, batch prefill `398.5ms` | batch decode attention `1.68s`, batch prefill verify `537.9ms`, GDR verify kernels visible but not dominant | Commit/replay/copy is not the leading bottleneck; long c16 still improves `+24.89%`. |

## Claim Boundaries

- This is an opt-in Qwen3.5 DFlash path with real c4/c8/c16 in-process benchmark wins. Token sanity is exact where stable; prompt-length-1 and a few long high-concurrency synthetic cases are covered by the same regret oracle used by the scheduler gate for bf16 near-tie / prefill-vs-decode boundary flips.
- The performance table is in-process benchmark evidence. Do not read it as an HTTP serving pressure claim.
- Single-concurrency random synthetic prompts remain flat to slightly slower. The multi-active path is the supported performance claim for this slice.
- Multi-GPU, LoRA, KV offload, decode overlap, non-greedy sampling, and logprobs intentionally use normal decode or fail closed.
14 changes: 8 additions & 6 deletions docs/models/qwen35/roadmap.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Qwen3.5-4B Roadmap

> **TL;DR:** Qwen3.5-4B is decode-correct and still improving: the decode-tuning refresh improves direct TPOT by `2.1-3.2%`, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Long-prompt HF logits and GSM8K gates cover the old 4096-position RoPE boundary. Remaining structural items are HND prefill staging, prefix-cache design, and the serving-level concurrency gap.
> **TL;DR:** Qwen3.5-4B is decode-correct and still improving: DFlash speculative decoding is now opt-in, default-off, and batched for multi-active decode. Same-host RTX 5090 A/B shows c4/c8/c16 gains for decode-heavy, medium, and long prompts; token sanity is exact where stable and covered by the DFlash regret oracle for bf16 near-tie cases. Remaining structural items are HND prefill staging, prefix-cache design, serving-level HTTP pressure validation for DFlash, and broader non-greedy/TP feature coverage.
>
> **Last touched:** 2026-06
> **Last touched:** 2026-07
Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc: `docs/models/qwen3/roadmap.md` — batched sampling is shared and #284 now routes Qwen3.5 decode through the same compact batched sampler; Qwen3.5 now has its own model-level non-greedy behavior gate, while qwen3 keeps the sibling gate on its side.

Expand All @@ -20,6 +20,7 @@ Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc:
| Admission | ✓ existing full-lifetime KV admission and explicit `Rejected` events cover impossible KV requests; #253 adds the context-window rejection reason before prefill/decode | `scheduler.rs`, `src/scheduler/plan.rs`, `docs/models/qwen35/kv-admission.md` |
| Scheduler tests | Partial: current plan selection, full-lifetime admission, context-window rejection, slot assignment, and slot-compaction decisions are CPU-tested; GPU execution remains coupled to the production scheduler | `src/scheduler/plan.rs` |
| Step tail | Local branch verified: #353 batches the prefill final norm/lm_head tail, samples decode/unified rows from batched logits, and keeps host full-vocab copies only for requested logprobs; HF/e2e gates pass, short-output serving A/B shows TTFT benefit, long-decode TPOT remains a no-claim diagnostic | `docs/models/qwen35/batched-step-tail.md` |
| DFlash speculative decode | Opt-in batched path: hybrid-state verify/commit covers KV + recurrent + conv state; correctness and scheduler e2e gates pass. Same-host in-process A/B improves `prompt_len=1` c4/c8/c16 by `+16.7%/+15.4%/+14.0%`, `prompt_len=1024` by `+209.9%/+168.6%/+45.3%`, and `prompt_len=4096` by `+135.9%/+35.7%/+25.6%`. | `docs/models/qwen35/dflash-speculative-decoding.md` |
| TP | ✗ absent (single GPU only) ||
| Prefix cache | ✗ absent; recurrent GDR state (~48MB per boundary snapshot) makes "prefix hit" itself a design question ||

Expand All @@ -34,10 +35,11 @@ Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc:

### Next

5. **Prefill full-paged migration.** Replace the HND staging copy with direct paged writes: removes the ~640MB transient and the extra D2D pass. Chain dependency: paged-direct prefill → per-token position plumbing → RoPE/context-window invariants → opens the door to prefix-cache design.
6. **Serving-level concurrency profiling.** Add a measured-only server-side range, then split the 1024/256 concurrency-16 gap across scheduler wait, event sync, request dispatch, and model execution. Also teach the Qwen3.5 direct decode bench to prove cached-token exclusion before it reports pure decode TPOT.
7. **Scheduler logic seam follow-through.** The current admission/slot/compaction decisions have a CPU-tested seam. Keep future admission and rejection changes in that seam instead of re-embedding them in GPU execution.
8. **Prefix-cache design note.** Linear-attention layers carry recurrent state, not KV blocks — a "prefix hit" must restore both the full-attention KV *and* a recurrent-state snapshot at a block boundary (~48MB per boundary at bf16). Whether to snapshot per block, per N blocks, or only at request end is an open trade; write the design note before any code. Depends on 5.
5. **DFlash serving pressure validation.** The batched in-process DFlash path is now positive for c4/c8/c16. Next evidence step is HTTP/OpenAI-compatible pressure with the same baseline-vs-DFlash contract, including completed/failed counts, TTFT, effective TPOT, raw ITL, and token hash sanity.
6. **Prefill full-paged migration.** Replace the HND staging copy with direct paged writes: removes the ~640MB transient and the extra D2D pass. Chain dependency: paged-direct prefill → per-token position plumbing → RoPE/context-window invariants → opens the door to prefix-cache design.
7. **Serving-level concurrency profiling.** Add a measured-only server-side range, then split the 1024/256 concurrency-16 gap across scheduler wait, event sync, request dispatch, and model execution. Also teach the Qwen3.5 direct decode bench to prove cached-token exclusion before it reports pure decode TPOT.
8. **Scheduler logic seam follow-through.** The current admission/slot/compaction decisions have a CPU-tested seam. Keep future admission and rejection changes in that seam instead of re-embedding them in GPU execution.
9. **Prefix-cache design note.** Linear-attention layers carry recurrent state, not KV blocks — a "prefix hit" must restore both the full-attention KV *and* a recurrent-state snapshot at a block boundary (~48MB per boundary at bf16). Whether to snapshot per block, per N blocks, or only at request end is an open trade; write the design note before any code. Depends on 6.

### Later

Expand Down
41 changes: 41 additions & 0 deletions openinfer-core/src/kv_pool.rs
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,20 @@ impl KvState {
self.seq_len += count;
}

/// Roll this request's logical KV length back to `token_count`, returning
/// any now-unused tail pages to the pool.
pub fn truncate_to(&mut self, token_count: usize) -> Result<()> {
anyhow::ensure!(
token_count <= self.seq_len,
"KvState cannot truncate from {} up to {token_count}",
self.seq_len
);
let needed = pages_needed(token_count, self.pool.inner.layout.page_size);
self.permit.truncate(needed);
self.seq_len = token_count;
Ok(())
}

/// Build kernel-facing metadata for this request's KV.
pub fn desc(&self) -> KvDesc<'_> {
KvDesc {
Expand Down Expand Up @@ -348,12 +362,39 @@ mod tests {
assert_eq!(desc.last_page_len(), 1);
assert_eq!(pool.available_pages(), 2);

// Truncate back into the first page: tail page returns immediately.
kv.truncate_to(15).unwrap();
assert_eq!(kv.seq_len(), 15);
let desc = kv.desc();
assert_eq!(desc.num_pages(), 1);
assert_eq!(desc.last_page_len(), 15);
assert_eq!(pool.available_pages(), 3);

// Truncate to zero releases all request pages.
kv.truncate_to(0).unwrap();
assert_eq!(kv.seq_len(), 0);
assert_eq!(kv.desc().num_pages(), 0);
assert_eq!(pool.available_pages(), 4);

// Reset returns all pages
kv.ensure_capacity(17).unwrap();
kv.advance(17);
kv.reset();
assert_eq!(kv.seq_len(), 0);
assert_eq!(pool.available_pages(), 4);
}

#[test]
fn kv_state_rejects_truncate_forward() {
let pool = test_pool(16, 3);
let mut kv = pool.alloc();
kv.ensure_capacity(4).unwrap();
kv.advance(4);

let err = kv.truncate_to(5).unwrap_err().to_string();
assert!(err.contains("cannot truncate from 4 up to 5"));
}

#[test]
fn kv_state_out_of_pages() {
// 3 pages total: 1 padding, 2 available → 32 tokens max
Expand Down
2 changes: 1 addition & 1 deletion openinfer-core/src/ops.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ pub use openinfer_kernels::ops::{
rms_norm_gated_batch_into, rms_norm_into, rms_norm_offset_into, scale_f32_in_place,
scaled_add_batch_into, scaled_add_rows_indexed_into, scaled_add_rows_into,
scaled_add_rows_token_range_into, silu_mul_batch, silu_mul_batch_into,
single_prefill_nhd_noncausal_into, write_vec_into,
single_prefill_nhd_causal_window_into, single_prefill_nhd_noncausal_into, write_vec_into,
};
#[cfg(not(feature = "kernel-call-trace"))]
pub use openinfer_kernels::ops::{
Expand Down
Loading