Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| `models/qwen3/prefix-cache.md` | Prefix caching on by default for Qwen3-4B: full-block kvbm radix matching at the executor, suffix-only prefill. Repeated ~1900-token prompt TTFT 141.8 → 16.3ms p50 (8.7×); warm TTFT ≈ TPOT + ~5ms setup. Includes the RoPE scalar-path corruption fix and the drain-the-stream TTFT measurement pitfall. |
| `models/qwen3/accuracy-gate.md` | Qwen3-4B instance of the logits golden gate (`tests/hf_golden_gate.rs`): 48 teacher-forced sequences / 816 positions vs a stored HF bf16 golden, replayed over bs=1 / batched eager / CUDA-graph. Strict guards: regret check + mean ≤ 0.06 + p99 ≤ 0.20; absolute max printed but not asserted (coverage-unstable). Methodology in `subsystems/correctness/`. |
| `models/qwen3/kernels-crate.md` | Phase 1 split implemented and 5090-verified: Qwen3-4B kernel surface lives in `openinfer-kernels`; release build, test-target compile, accuracy gate, and bench snapshot pass. |
| `models/qwen3/dflash-model-download.md` | Download record for `z-lab/Qwen3-4B-DFlash-b16`: local `/data/models/Qwen3-4B-DFlash-b16` and 5090 `/data/Qwen3-4B-DFlash-b16`, the drafter artifact for Qwen3-4B speculative decoding bring-up. |
| `models/qwen3/dflash-speculative-decoding.md` | Qwen3-4B DFlash is wired for greedy TP1 serving with native drafter/verifier/scheduler, INFO acceptance logs with `committed_tokens`, verifier-span/full-draft HF regret gate, config-derived DFlash reserve, side-state byte-budget admission, transactional speculative KV rollback, per-request prefill capture, chunked-prefill continuity checks, and DFlash small-N cublasLt tuning. Draft PR #380. Latest local 5070 Ti PR-head results: Spec-Bench bs=1 149.32 tok/s (1.66x), Spec-Bench c4 330.42 tok/s (1.09x), random 1024/128 c4 349.50 tok/s (1.29x); post-hardening Spec-Bench c4 n=12 smoke completed 12/12 at 368.71 tok/s with four concurrent DFlash requests logged in one wave; 5090 OpenInfer Spec-Bench bs=1 is 251.48 tok/s and vLLM 0.22.1 DFlash reaches 289.57 tok/s with native acceptance metrics. |
| `models/qwen3/tp-design.md` | Qwen3 tensor-parallel design: `TP=2` milestone scope plus the controller/worker broadcast execution model, request identity, and coarse-grained step protocol for future TP/MoE work. |
| `models/qwen3/kv-pressure-hang.md` | Issue #85 Qwen3-4B KV pressure hang fixed by full-lifetime scheduler KV admission, waiting-queue deferral, cleanup on disconnect/error, impossible-request errors, scheduler/bridge gates, and real `vllm bench serve` QPS=2 `500/500` pass with post-pressure completion healthy. |

Expand Down
72 changes: 72 additions & 0 deletions docs/models/qwen3/dflash-model-download.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# DFlash Model Download

> **TL;DR:** `z-lab/Qwen3-4B-DFlash-b16` is downloaded and verified locally at `/data/models/Qwen3-4B-DFlash-b16` and on the 5090 box at `/data/Qwen3-4B-DFlash-b16`; it is the drafter artifact for Qwen3-4B speculative decoding bring-up.
>
> **Last touched:** 2026-06

## Preparation

- **Read**:
- `docs/index.md` - Qwen3 model-line docs live under `docs/models/qwen3/`.
- `docs/models/qwen3/model-crate.md` - Qwen3 runtime is model-crate owned; local model artifacts under `/data/models` are used by executor/tests through model paths.
- **Relevant history**:
- No existing DFlash or speculative-decoding task doc found.
- **Plan**:
1. Check `/data/models` capacity and whether the target directory already exists.
2. Download `z-lab/Qwen3-4B-DFlash-b16` with the Hugging Face Hub CLI into `/data/models/Qwen3-4B-DFlash-b16`.
3. List the downloaded files and verify the safetensors/config metadata is present.
- **Risks / open questions**:
- The Hugging Face repo may require auth or may use custom code files that are not part of the plain safetensors load path yet.

## Execution Log

### Step 1: Check destination and capacity
- `/data/models` already contains the base `/data/models/Qwen3-4B` artifact.
- `df -h /data/models /data` reported `753G` available on `/data`, enough for the DFlash drafter.
- Target path chosen: `/data/models/Qwen3-4B-DFlash-b16`.

### Step 2: Download from Hugging Face
- Command:
```bash
uvx --from huggingface_hub hf download z-lab/Qwen3-4B-DFlash-b16 --local-dir /data/models/Qwen3-4B-DFlash-b16
```
- Result: fetched `9` files, `1.08GB` total, into `/data/models/Qwen3-4B-DFlash-b16`.

### Step 3: Verify local files
- `find /data/models/Qwen3-4B-DFlash-b16 -maxdepth 2 -type f` shows:
- `config.json`
- `model.safetensors`
- `modeling_dflash.py`
- `dflash.py`
- `utils.py`
- `README.md`
- `.gitattributes`
- `assets/dflash_system.png`
- `assets/speedup.png`
- `du -sh /data/models/Qwen3-4B-DFlash-b16` reports `1.1G`.
- `jq` parsed `config.json`; `architectures = ["DFlashDraftModel"]`, `hidden_size = 2560`, `num_hidden_layers = 5`, `vocab_size = 151936`.
- `safetensors` opened `model.safetensors` successfully and reported `58` tensors.

### Step 4: Place artifact on 5090
- User explicitly approved placing a copy under `/data` on the 5090 box.
- Remote path: `/data/Qwen3-4B-DFlash-b16`.
- Download command used the 5090 proxy from root's `.bashrc`:
```bash
export http_proxy=http://172.17.0.1:1081
export https_proxy=http://172.17.0.1:1081
hf download z-lab/Qwen3-4B-DFlash-b16 --local-dir /data/Qwen3-4B-DFlash-b16 --max-workers 8
```
- The 5090 copy contains the same core files as the local copy; `model.safetensors` is `1074860568` bytes.
- Real-weight validation now passes on 5090 with:
```bash
OPENINFER_TEST_MODEL_PATH=/data/Qwen3-4B OPENINFER_DFLASH_TEST_MODEL_PATH=/data/Qwen3-4B-DFlash-b16 cargo test --release -p openinfer-qwen3-4b dflash::tests::downloaded_dflash_config_matches_qwen3_4b --lib -- --nocapture
```

## Debrief

- **Outcome**: The DFlash drafter model is present at `/data/models/Qwen3-4B-DFlash-b16` locally and `/data/Qwen3-4B-DFlash-b16` on 5090.
- **Pitfalls encountered**:
- None during download. The repo includes Python custom-code files, so runtime integration still needs a native Rust loader/forward path rather than relying on `trust_remote_code`.
- **Lessons learned**:
- The artifact is small enough (`1.1G`) to keep alongside the base Qwen3-4B model.
- `config.json` does not set `torch_dtype`; integration should infer/check tensor dtype from safetensors rather than trusting that config field.
Loading