Profiling: report fine-grained TTFT / TPOT / output throughput instead of mixed e2e tok/s

## Summary

`npu_generate.py`'s `--profile` Timing Report collapses prefill and decode into a single mixed **`throughput (e2e)`** number (`total tokens / generate-phase wall-clock`). This metric mixes a one-time latency cost (prefill, including first-dispatch warmup) with steady-state decode, so it both **understates steady-state throughput** and **hides the latency/throughput distinction** that matters for deployment. We should report deployment-realistic, separated metrics: **TTFT / TPOT / output throughput**, with one-time warmup excluded.

## Current behavior

The report prints:

```
[phase] generate (e2e)       :    10.02s   (32 tokens generated)
[phase] throughput (e2e)     :     3.19 tok/s
[api]   run_prefill          :     7.99s   (1 call, TTFT-ish)
[api]   run_decode total     :     1.66s   (31 steps, avg  53.5 ms/step, 18.69 step/s)
  kernel.prefill_fwd  : n=1   avg=1915.74ms
  kernel.decode_layer : n=31  avg=  45.51ms (min 43.73 / max 61.99)
```

`throughput (e2e) = 32 / 10.02s = 3.19 tok/s`.

### Why this is misleading

1. **Prefill warmup is folded in.** `run_prefill` API time is 7.99s but the prefill **kernel** is only ~1.9s — ~6s is one-time first-dispatch host-side warmup (orchestration first launch, runtime warmup). In real deployment this is paid once at startup, not per request.
2. **Latency and throughput are mixed.** Prefill is a per-request *latency* (TTFT) cost; decode is a *throughput* (TPOT) cost. Averaging them into one tok/s is not actionable.
3. **Decode step-0 warmup is included.** Step 0 = 62ms vs steady ~44–45ms.

Steady-state decode is actually **~44.7 ms/step ≈ 22 tok/s**, but the headline number says 3.19 tok/s — a ~7x understatement of generation throughput.

> Note: `init_model` (weight load + kernel compile, ~65s) is already correctly excluded from `throughput (e2e)`. This issue is about the *remaining* one-time costs inside the generate phase.

## Proposed metrics (deployment口径)

Report these separately (mirroring vLLM's TTFT / TPOT / output-throughput convention):

- **TTFT (Time To First Token)** — prefill latency. Report both *first request* (with warmup) and *steady-state* (warmed-up prefill kernel), since they differ by ~6s here.
- **TPOT (Time Per Output Token)** — steady-state decode per-token time, **excluding the step-0 warmup** (and ideally median/p50 + p99, not just mean).
- **Output throughput** — decode-only steady-state `tok/s = 1000 / TPOT_ms`.
- Keep the raw kernel aggregates (prefill_fwd / decode_layer) as-is.
- Drop or clearly relabel the mixed `throughput (e2e)` so it isn't read as the serving throughput.

### Suggested report shape

```
TTFT (first req)        :  7.99 s    (incl. one-time warmup)
TTFT (steady, kernel)   :  1.92 s
TPOT (steady, p50)      :  44.7 ms/token   (excl. step-0 warmup)
Output throughput       :  22.4 tok/s      (decode-only, steady)
```

## Scope

- `examples/model/qwen3_14b/npu_generate.py` — `PrintTimingReport` / the timing collector.
- Exclude the first decode step (and document the choice) when computing TPOT/output-throughput.
- Optionally surface p50/p99 for TPOT.

## Environment

- repo: `pypto-serving` @ `main` (970312c)
- pypto-lib submodule: `30e2cfe`
- pypto: `a8258003`, runtime/simpler: `48980572`
- model: Qwen3-14B, platform a2a3, `PTO2_RING_HEAP=2GiB`, max-seq-len 512, max-new-tokens 32


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profiling: report fine-grained TTFT / TPOT / output throughput instead of mixed e2e tok/s #28

Summary

Current behavior

Why this is misleading

Proposed metrics (deployment口径)

Suggested report shape

Scope

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Profiling: report fine-grained TTFT / TPOT / output throughput instead of mixed e2e tok/s #28

Description

Summary

Current behavior

Why this is misleading

Proposed metrics (deployment口径)

Suggested report shape

Scope

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions