Skip to content

Profiling: report fine-grained TTFT / TPOT / output throughput instead of mixed e2e tok/s #28

Description

@lyfne123

Summary

npu_generate.py's --profile Timing Report collapses prefill and decode into a single mixed throughput (e2e) number (total tokens / generate-phase wall-clock). This metric mixes a one-time latency cost (prefill, including first-dispatch warmup) with steady-state decode, so it both understates steady-state throughput and hides the latency/throughput distinction that matters for deployment. We should report deployment-realistic, separated metrics: TTFT / TPOT / output throughput, with one-time warmup excluded.

Current behavior

The report prints:

[phase] generate (e2e)       :    10.02s   (32 tokens generated)
[phase] throughput (e2e)     :     3.19 tok/s
[api]   run_prefill          :     7.99s   (1 call, TTFT-ish)
[api]   run_decode total     :     1.66s   (31 steps, avg  53.5 ms/step, 18.69 step/s)
  kernel.prefill_fwd  : n=1   avg=1915.74ms
  kernel.decode_layer : n=31  avg=  45.51ms (min 43.73 / max 61.99)

throughput (e2e) = 32 / 10.02s = 3.19 tok/s.

Why this is misleading

  1. Prefill warmup is folded in. run_prefill API time is 7.99s but the prefill kernel is only ~1.9s — ~6s is one-time first-dispatch host-side warmup (orchestration first launch, runtime warmup). In real deployment this is paid once at startup, not per request.
  2. Latency and throughput are mixed. Prefill is a per-request latency (TTFT) cost; decode is a throughput (TPOT) cost. Averaging them into one tok/s is not actionable.
  3. Decode step-0 warmup is included. Step 0 = 62ms vs steady ~44–45ms.

Steady-state decode is actually ~44.7 ms/step ≈ 22 tok/s, but the headline number says 3.19 tok/s — a ~7x understatement of generation throughput.

Note: init_model (weight load + kernel compile, ~65s) is already correctly excluded from throughput (e2e). This issue is about the remaining one-time costs inside the generate phase.

Proposed metrics (deployment口径)

Report these separately (mirroring vLLM's TTFT / TPOT / output-throughput convention):

  • TTFT (Time To First Token) — prefill latency. Report both first request (with warmup) and steady-state (warmed-up prefill kernel), since they differ by ~6s here.
  • TPOT (Time Per Output Token) — steady-state decode per-token time, excluding the step-0 warmup (and ideally median/p50 + p99, not just mean).
  • Output throughput — decode-only steady-state tok/s = 1000 / TPOT_ms.
  • Keep the raw kernel aggregates (prefill_fwd / decode_layer) as-is.
  • Drop or clearly relabel the mixed throughput (e2e) so it isn't read as the serving throughput.

Suggested report shape

TTFT (first req)        :  7.99 s    (incl. one-time warmup)
TTFT (steady, kernel)   :  1.92 s
TPOT (steady, p50)      :  44.7 ms/token   (excl. step-0 warmup)
Output throughput       :  22.4 tok/s      (decode-only, steady)

Scope

  • examples/model/qwen3_14b/npu_generate.pyPrintTimingReport / the timing collector.
  • Exclude the first decode step (and document the choice) when computing TPOT/output-throughput.
  • Optionally surface p50/p99 for TPOT.

Environment

  • repo: pypto-serving @ main (970312c)
  • pypto-lib submodule: 30e2cfe
  • pypto: a8258003, runtime/simpler: 48980572
  • model: Qwen3-14B, platform a2a3, PTO2_RING_HEAP=2GiB, max-seq-len 512, max-new-tokens 32

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions