openinfer-project · xiaguan · Jun 14, 2026 · Jun 14, 2026 · Jun 14, 2026
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/docs/index.md b/docs/index.md
@@ -30,6 +30,8 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
 | `models/qwen3/prefix-cache.md` | Prefix caching on by default for Qwen3-4B: full-block kvbm radix matching at the executor, suffix-only prefill. Repeated ~1900-token prompt TTFT 141.8 → 16.3ms p50 (8.7×); warm TTFT ≈ TPOT + ~5ms setup. Includes the RoPE scalar-path corruption fix and the drain-the-stream TTFT measurement pitfall. |
 | `models/qwen3/accuracy-gate.md` | Qwen3-4B instance of the logits golden gate (`tests/hf_golden_gate.rs`): 48 teacher-forced sequences / 816 positions vs a stored HF bf16 golden, replayed over bs=1 / batched eager / CUDA-graph. Strict guards: regret check + mean ≤ 0.06 + p99 ≤ 0.20; absolute max printed but not asserted (coverage-unstable). Methodology in `subsystems/correctness/`. |
 | `models/qwen3/kernels-crate.md` | Phase 1 split implemented and 5090-verified: Qwen3-4B kernel surface lives in `openinfer-kernels`; release build, test-target compile, accuracy gate, and bench snapshot pass. |
+| `models/qwen3/dflash-model-download.md` | Download record for `z-lab/Qwen3-4B-DFlash-b16`: local `/data/models/Qwen3-4B-DFlash-b16` and 5090 `/data/Qwen3-4B-DFlash-b16`, the drafter artifact for Qwen3-4B speculative decoding bring-up. |
+| `models/qwen3/dflash-speculative-decoding.md` | Qwen3-4B DFlash is wired for greedy TP1 serving with native drafter/verifier/scheduler, INFO acceptance logs with `committed_tokens`, verifier-span/full-draft HF regret gate, config-derived DFlash reserve, side-state byte-budget admission, transactional speculative KV rollback, per-request prefill capture, chunked-prefill continuity checks, and DFlash small-N cublasLt tuning. Draft PR #380. Latest local 5070 Ti PR-head results: Spec-Bench bs=1 149.32 tok/s (1.66x), Spec-Bench c4 330.42 tok/s (1.09x), random 1024/128 c4 349.50 tok/s (1.29x); post-hardening Spec-Bench c4 n=12 smoke completed 12/12 at 368.71 tok/s with four concurrent DFlash requests logged in one wave; 5090 OpenInfer Spec-Bench bs=1 is 251.48 tok/s and vLLM 0.22.1 DFlash reaches 289.57 tok/s with native acceptance metrics. |
 | `models/qwen3/tp-design.md` | Qwen3 tensor-parallel design: `TP=2` milestone scope plus the controller/worker broadcast execution model, request identity, and coarse-grained step protocol for future TP/MoE work. |
 | `models/qwen3/kv-pressure-hang.md` | Issue #85 Qwen3-4B KV pressure hang fixed by full-lifetime scheduler KV admission, waiting-queue deferral, cleanup on disconnect/error, impossible-request errors, scheduler/bridge gates, and real `vllm bench serve` QPS=2 `500/500` pass with post-pressure completion healthy. |
 

diff --git a/docs/models/qwen3/dflash-model-download.md b/docs/models/qwen3/dflash-model-download.md
@@ -0,0 +1,72 @@
+# DFlash Model Download
+
+> **TL;DR:** `z-lab/Qwen3-4B-DFlash-b16` is downloaded and verified locally at `/data/models/Qwen3-4B-DFlash-b16` and on the 5090 box at `/data/Qwen3-4B-DFlash-b16`; it is the drafter artifact for Qwen3-4B speculative decoding bring-up.
+>
+> **Last touched:** 2026-06
+
+## Preparation
+
+- **Read**:
+  - `docs/index.md` - Qwen3 model-line docs live under `docs/models/qwen3/`.
+  - `docs/models/qwen3/model-crate.md` - Qwen3 runtime is model-crate owned; local model artifacts under `/data/models` are used by executor/tests through model paths.
+- **Relevant history**:
+  - No existing DFlash or speculative-decoding task doc found.
+- **Plan**:
+  1. Check `/data/models` capacity and whether the target directory already exists.
+  2. Download `z-lab/Qwen3-4B-DFlash-b16` with the Hugging Face Hub CLI into `/data/models/Qwen3-4B-DFlash-b16`.
+  3. List the downloaded files and verify the safetensors/config metadata is present.
+- **Risks / open questions**:
+  - The Hugging Face repo may require auth or may use custom code files that are not part of the plain safetensors load path yet.
+
+## Execution Log
+
+### Step 1: Check destination and capacity
+- `/data/models` already contains the base `/data/models/Qwen3-4B` artifact.
+- `df -h /data/models /data` reported `753G` available on `/data`, enough for the DFlash drafter.
+- Target path chosen: `/data/models/Qwen3-4B-DFlash-b16`.
+
+### Step 2: Download from Hugging Face
+- Command:
+  ```bash
+  uvx --from huggingface_hub hf download z-lab/Qwen3-4B-DFlash-b16 --local-dir /data/models/Qwen3-4B-DFlash-b16
+  ```
+- Result: fetched `9` files, `1.08GB` total, into `/data/models/Qwen3-4B-DFlash-b16`.
+
+### Step 3: Verify local files
+- `find /data/models/Qwen3-4B-DFlash-b16 -maxdepth 2 -type f` shows:
+  - `config.json`
+  - `model.safetensors`
+  - `modeling_dflash.py`
+  - `dflash.py`
+  - `utils.py`
+  - `README.md`
+  - `.gitattributes`
+  - `assets/dflash_system.png`
+  - `assets/speedup.png`
+- `du -sh /data/models/Qwen3-4B-DFlash-b16` reports `1.1G`.
+- `jq` parsed `config.json`; `architectures = ["DFlashDraftModel"]`, `hidden_size = 2560`, `num_hidden_layers = 5`, `vocab_size = 151936`.
+- `safetensors` opened `model.safetensors` successfully and reported `58` tensors.
+
+### Step 4: Place artifact on 5090
+- User explicitly approved placing a copy under `/data` on the 5090 box.
+- Remote path: `/data/Qwen3-4B-DFlash-b16`.
+- Download command used the 5090 proxy from root's `.bashrc`:
+  ```bash
+  export http_proxy=http://172.17.0.1:1081
+  export https_proxy=http://172.17.0.1:1081
+  hf download z-lab/Qwen3-4B-DFlash-b16 --local-dir /data/Qwen3-4B-DFlash-b16 --max-workers 8
+  ```
+- The 5090 copy contains the same core files as the local copy; `model.safetensors` is `1074860568` bytes.
+- Real-weight validation now passes on 5090 with:
+  ```bash
+  OPENINFER_TEST_MODEL_PATH=/data/Qwen3-4B OPENINFER_DFLASH_TEST_MODEL_PATH=/data/Qwen3-4B-DFlash-b16 cargo test --release -p openinfer-qwen3-4b dflash::tests::downloaded_dflash_config_matches_qwen3_4b --lib -- --nocapture
+  ```
+
+## Debrief
+
+- **Outcome**: The DFlash drafter model is present at `/data/models/Qwen3-4B-DFlash-b16` locally and `/data/Qwen3-4B-DFlash-b16` on 5090.
+- **Pitfalls encountered**:
+  - None during download. The repo includes Python custom-code files, so runtime integration still needs a native Rust loader/forward path rather than relying on `trust_remote_code`.
+- **Lessons learned**:
+  - The artifact is small enough (`1.1G`) to keep alongside the base Qwen3-4B model.
+  - `config.json` does not set `torch_dtype`; integration should infer/check tensor dtype from safetensors rather than trusting that config field.