Summary
npu_generate.py's --profile Timing Report collapses prefill and decode into a single mixed throughput (e2e) number (total tokens / generate-phase wall-clock). This metric mixes a one-time latency cost (prefill, including first-dispatch warmup) with steady-state decode, so it both understates steady-state throughput and hides the latency/throughput distinction that matters for deployment. We should report deployment-realistic, separated metrics: TTFT / TPOT / output throughput, with one-time warmup excluded.
Current behavior
The report prints:
[phase] generate (e2e) : 10.02s (32 tokens generated)
[phase] throughput (e2e) : 3.19 tok/s
[api] run_prefill : 7.99s (1 call, TTFT-ish)
[api] run_decode total : 1.66s (31 steps, avg 53.5 ms/step, 18.69 step/s)
kernel.prefill_fwd : n=1 avg=1915.74ms
kernel.decode_layer : n=31 avg= 45.51ms (min 43.73 / max 61.99)
throughput (e2e) = 32 / 10.02s = 3.19 tok/s.
Why this is misleading
- Prefill warmup is folded in.
run_prefill API time is 7.99s but the prefill kernel is only ~1.9s — ~6s is one-time first-dispatch host-side warmup (orchestration first launch, runtime warmup). In real deployment this is paid once at startup, not per request.
- Latency and throughput are mixed. Prefill is a per-request latency (TTFT) cost; decode is a throughput (TPOT) cost. Averaging them into one tok/s is not actionable.
- Decode step-0 warmup is included. Step 0 = 62ms vs steady ~44–45ms.
Steady-state decode is actually ~44.7 ms/step ≈ 22 tok/s, but the headline number says 3.19 tok/s — a ~7x understatement of generation throughput.
Note: init_model (weight load + kernel compile, ~65s) is already correctly excluded from throughput (e2e). This issue is about the remaining one-time costs inside the generate phase.
Proposed metrics (deployment口径)
Report these separately (mirroring vLLM's TTFT / TPOT / output-throughput convention):
- TTFT (Time To First Token) — prefill latency. Report both first request (with warmup) and steady-state (warmed-up prefill kernel), since they differ by ~6s here.
- TPOT (Time Per Output Token) — steady-state decode per-token time, excluding the step-0 warmup (and ideally median/p50 + p99, not just mean).
- Output throughput — decode-only steady-state
tok/s = 1000 / TPOT_ms.
- Keep the raw kernel aggregates (prefill_fwd / decode_layer) as-is.
- Drop or clearly relabel the mixed
throughput (e2e) so it isn't read as the serving throughput.
Suggested report shape
TTFT (first req) : 7.99 s (incl. one-time warmup)
TTFT (steady, kernel) : 1.92 s
TPOT (steady, p50) : 44.7 ms/token (excl. step-0 warmup)
Output throughput : 22.4 tok/s (decode-only, steady)
Scope
examples/model/qwen3_14b/npu_generate.py — PrintTimingReport / the timing collector.
- Exclude the first decode step (and document the choice) when computing TPOT/output-throughput.
- Optionally surface p50/p99 for TPOT.
Environment
- repo:
pypto-serving @ main (970312c)
- pypto-lib submodule:
30e2cfe
- pypto:
a8258003, runtime/simpler: 48980572
- model: Qwen3-14B, platform a2a3,
PTO2_RING_HEAP=2GiB, max-seq-len 512, max-new-tokens 32
Summary
npu_generate.py's--profileTiming Report collapses prefill and decode into a single mixedthroughput (e2e)number (total tokens / generate-phase wall-clock). This metric mixes a one-time latency cost (prefill, including first-dispatch warmup) with steady-state decode, so it both understates steady-state throughput and hides the latency/throughput distinction that matters for deployment. We should report deployment-realistic, separated metrics: TTFT / TPOT / output throughput, with one-time warmup excluded.Current behavior
The report prints:
throughput (e2e) = 32 / 10.02s = 3.19 tok/s.Why this is misleading
run_prefillAPI time is 7.99s but the prefill kernel is only ~1.9s — ~6s is one-time first-dispatch host-side warmup (orchestration first launch, runtime warmup). In real deployment this is paid once at startup, not per request.Steady-state decode is actually ~44.7 ms/step ≈ 22 tok/s, but the headline number says 3.19 tok/s — a ~7x understatement of generation throughput.
Proposed metrics (deployment口径)
Report these separately (mirroring vLLM's TTFT / TPOT / output-throughput convention):
tok/s = 1000 / TPOT_ms.throughput (e2e)so it isn't read as the serving throughput.Suggested report shape
Scope
examples/model/qwen3_14b/npu_generate.py—PrintTimingReport/ the timing collector.Environment
pypto-serving@main(970312c)30e2cfea8258003, runtime/simpler:48980572PTO2_RING_HEAP=2GiB, max-seq-len 512, max-new-tokens 32