Background
PR #507 includes same-host in-process benchmark evidence for Qwen3.5 DFlash and keeps the claim boundary explicit: it is not an HTTP serving pressure claim.
Qwen3 DFlash has serving-level measurement scripts under tools/bench/, including run_serving_bench.sh and qps_sweep.sh. Qwen3.5 should get the same public evidence layer before we make broader OpenAI-compatible serving claims.
Goal
Run and document a Qwen3.5 DFlash HTTP serving A/B sweep using the existing serving benchmark scripts.
Suggested scope
- Launch Qwen3.5 baseline and Qwen3.5 + DFlash through the OpenAI-compatible server path.
- Use greedy requests with
temperature=0, ignore_eos, and percentile metrics for ttft,tpot,itl,e2el.
- Cover at least
prompt_len=1/1024/4096, output_len=256, and concurrency 1/4/8/16.
- Record completed/failed requests, TTFT, TPOT, ITL p50/p99, E2EL, output tok/s, acceptance length/rate, and token sanity.
- Keep the result separate from in-process benchmark evidence.
Validation
- Public benchmark table with commit, GPU model, CUDA/driver versions, workload shape, and pass/fail counts.
- No private hostnames, credentials, or local artifact paths in docs or comments.
Related: #434, #507.
Background
PR #507 includes same-host in-process benchmark evidence for Qwen3.5 DFlash and keeps the claim boundary explicit: it is not an HTTP serving pressure claim.
Qwen3 DFlash has serving-level measurement scripts under
tools/bench/, includingrun_serving_bench.shandqps_sweep.sh. Qwen3.5 should get the same public evidence layer before we make broader OpenAI-compatible serving claims.Goal
Run and document a Qwen3.5 DFlash HTTP serving A/B sweep using the existing serving benchmark scripts.
Suggested scope
temperature=0,ignore_eos, and percentile metrics forttft,tpot,itl,e2el.prompt_len=1/1024/4096,output_len=256, and concurrency1/4/8/16.Validation
Related: #434, #507.