Skip to content

qwen35: run DFlash HTTP serving pressure sweep #514

Description

@CAICAIIs

Background

PR #507 includes same-host in-process benchmark evidence for Qwen3.5 DFlash and keeps the claim boundary explicit: it is not an HTTP serving pressure claim.

Qwen3 DFlash has serving-level measurement scripts under tools/bench/, including run_serving_bench.sh and qps_sweep.sh. Qwen3.5 should get the same public evidence layer before we make broader OpenAI-compatible serving claims.

Goal

Run and document a Qwen3.5 DFlash HTTP serving A/B sweep using the existing serving benchmark scripts.

Suggested scope

  • Launch Qwen3.5 baseline and Qwen3.5 + DFlash through the OpenAI-compatible server path.
  • Use greedy requests with temperature=0, ignore_eos, and percentile metrics for ttft,tpot,itl,e2el.
  • Cover at least prompt_len=1/1024/4096, output_len=256, and concurrency 1/4/8/16.
  • Record completed/failed requests, TTFT, TPOT, ITL p50/p99, E2EL, output tok/s, acceptance length/rate, and token sanity.
  • Keep the result separate from in-process benchmark evidence.

Validation

  • Public benchmark table with commit, GPU model, CUDA/driver versions, workload shape, and pass/fail counts.
  • No private hostnames, credentials, or local artifact paths in docs or comments.

Related: #434, #507.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestqwen35Qwen3.5-4B model crate (pegainfer-qwen35-4b)

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions