[Profiling Report] Profiling qwen3-14B decode attention

### What was profiled

- **Workload:** qwen3-14B single decode layer (golden unit test), batch=16, varied sequence lengths, fp16/bf16, on a2a3 (910B), passed validation.
- **Tool:** simpler `tracr` branch runtime profiler (TraCR), capturing the AICPU scheduler + AICore execution timeline directly on device.
- **Capture:** **6,515 trace events** across **77 channels** — 4 AICPU scheduler threads, 24 AICube cores, 48 AIVector cores — with scheduler markers (`Orchestrating`, `Scheduling`, `Phase1–4`, `Drain`) and per-task execution (`Running_Task_Single/Pair`); kernels identified by task id.

### Key finding: the in-layer attention timeline has far fewer bubbles than the standalone example

Both settings compute the **same** paged-attention GQA, and the attention kernel itself has the **same inherent bubbles** in both. The difference is **what else is on the timeline to hide them**:

- **The decode layer dispatches ~30 kernels** (`q/k/v_proj`, `qk_norm`, `rope_qkv`, `fa_fused`, `online_softmax`, `out_proj`, `gate/up/down_proj`, `silu`, …). Attention is only **2 of them**. The compute-bound projection/MLP matmuls keep the 24 AICube cores saturated, so whenever attention stalls, the scheduler has other independent work running on those cores.

- **Standalone paged attention has nothing to overlap with**, so its inherent stalls are fully exposed as bubbles:
  - *memory-bound* — it streams the KV cache with low arithmetic intensity, so cores wait on HBM;
  - *serial QK → softmax → SV chain* that ping-pongs between AIC (cube matmuls) and AIV (vector softmax), each idling while the other runs;
  - *low decode parallelism* — only batch×kv_heads small work items, not enough to fill 24 cube + 48 vector cores.

- `fa_fused` additionally pipelines internally (`pl.pipeline(stage=2)`, overlapping iteration *i+1*'s QK with iteration *i*'s softmax), shrinking intra-attention gaps; whatever remains is absorbed by the surrounding layer.

### Takeaway

The bubbles are **latency-hidden, not eliminated**. A dense in-layer timeline is the *desired* outcome — it means the scheduler is successfully overlapping attention with the rest of the decode layer, which is what matters for end-to-end throughput. However:

- For **optimizing paged attention itself**, the **standalone example is the better diagnostic** — it exposes the real attention bottleneck (memory-bound + AIC↔AIV serialization) that the full-layer view masks.
- Any **isolated attention latency target** (e.g. the fa_fused + softmax budget) should be measured **standalone**; inside the full layer, part of attention's cost is hidden behind MLP/projection work, so an in-layer reading understates it.

### Profile view

<img width="1910" height="931" alt="Image" src="https://github.com/user-attachments/assets/b8d0e4b2-8af7-423a-8c48-17156376ccd0" />

This is the file to visualize on [Perfetto](https://ui.perfetto.dev/):
[tracr_qwen3_14b_decode.json](https://github.com/user-attachments/files/29298721/tracr_qwen3_14b_decode.json)

### How to recreate the plot

To create these profiles. One has to use the [Simpler with TraCR](https://github.com/huawei-csl/simpler/tree/tracr) branch, and when compiling Simpler, add the Env flag `BUILD_TRACR=ON pip install --no-build-isolation -e .`. [TraCR](https://github.com/huawei-csl/TracR) is a low-level profiler that captures traces on the Ascend device. Also, set `_DEFAULT_RUNTIME = "tensormap_and_ringbuffer"` inside the `pypto/runtime/worker.py` as currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce a `tracr_0/` in `~/ascend/`. This has to be post-processed by running this command:

`./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/`

It will generate a `tracr_0/perfetto.json` file, which can be viewed in [Perfetto](https://ui.perfetto.dev/).

### Motivation / Use Case

We wanted to understand the on-device behavior of paged-attention GQA in pypto3 by comparing two settings:

Standalone — the `pypto/examples/models/04_paged_attention.py` example (attention computed in isolation: a QK → softmax-prepare → SV kernel pipeline), as traced in #986.
In-context — the same paged-attention GQA as it actually runs inside the full qwen3-14B decode layer (`fa_fused` + `online_softmax`).
The goal was to see whether the attention bubbles visible in the standalone example also appear when attention runs as part of a real decode layer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Profiling Report] Profiling qwen3-14B decode attention #607

What was profiled

Key finding: the in-layer attention timeline has far fewer bubbles than the standalone example

Takeaway

Profile view

How to recreate the plot

Motivation / Use Case

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Profiling Report] Profiling qwen3-14B decode attention #607

Description

What was profiled

Key finding: the in-layer attention timeline has far fewer bubbles than the standalone example

Takeaway

Profile view

How to recreate the plot

Motivation / Use Case

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions