What was profiled
- Workload: qwen3-14B single decode layer (golden unit test), batch=16, varied sequence lengths, fp16/bf16, on a2a3 (910B), passed validation.
- Tool: simpler
tracr branch runtime profiler (TraCR), capturing the AICPU scheduler + AICore execution timeline directly on device.
- Capture: 6,515 trace events across 77 channels — 4 AICPU scheduler threads, 24 AICube cores, 48 AIVector cores — with scheduler markers (
Orchestrating, Scheduling, Phase1–4, Drain) and per-task execution (Running_Task_Single/Pair); kernels identified by task id.
Key finding: the in-layer attention timeline has far fewer bubbles than the standalone example
Both settings compute the same paged-attention GQA, and the attention kernel itself has the same inherent bubbles in both. The difference is what else is on the timeline to hide them:
-
The decode layer dispatches ~30 kernels (q/k/v_proj, qk_norm, rope_qkv, fa_fused, online_softmax, out_proj, gate/up/down_proj, silu, …). Attention is only 2 of them. The compute-bound projection/MLP matmuls keep the 24 AICube cores saturated, so whenever attention stalls, the scheduler has other independent work running on those cores.
-
Standalone paged attention has nothing to overlap with, so its inherent stalls are fully exposed as bubbles:
- memory-bound — it streams the KV cache with low arithmetic intensity, so cores wait on HBM;
- serial QK → softmax → SV chain that ping-pongs between AIC (cube matmuls) and AIV (vector softmax), each idling while the other runs;
- low decode parallelism — only batch×kv_heads small work items, not enough to fill 24 cube + 48 vector cores.
-
fa_fused additionally pipelines internally (pl.pipeline(stage=2), overlapping iteration i+1's QK with iteration i's softmax), shrinking intra-attention gaps; whatever remains is absorbed by the surrounding layer.
Takeaway
The bubbles are latency-hidden, not eliminated. A dense in-layer timeline is the desired outcome — it means the scheduler is successfully overlapping attention with the rest of the decode layer, which is what matters for end-to-end throughput. However:
- For optimizing paged attention itself, the standalone example is the better diagnostic — it exposes the real attention bottleneck (memory-bound + AIC↔AIV serialization) that the full-layer view masks.
- Any isolated attention latency target (e.g. the fa_fused + softmax budget) should be measured standalone; inside the full layer, part of attention's cost is hidden behind MLP/projection work, so an in-layer reading understates it.
Profile view
This is the file to visualize on Perfetto:
tracr_qwen3_14b_decode.json
How to recreate the plot
To create these profiles. One has to use the Simpler with TraCR branch, and when compiling Simpler, add the Env flag BUILD_TRACR=ON pip install --no-build-isolation -e .. TraCR is a low-level profiler that captures traces on the Ascend device. Also, set _DEFAULT_RUNTIME = "tensormap_and_ringbuffer" inside the pypto/runtime/worker.py as currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce a tracr_0/ in ~/ascend/. This has to be post-processed by running this command:
./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/
It will generate a tracr_0/perfetto.json file, which can be viewed in Perfetto.
Motivation / Use Case
We wanted to understand the on-device behavior of paged-attention GQA in pypto3 by comparing two settings:
Standalone — the pypto/examples/models/04_paged_attention.py example (attention computed in isolation: a QK → softmax-prepare → SV kernel pipeline), as traced in #986.
In-context — the same paged-attention GQA as it actually runs inside the full qwen3-14B decode layer (fa_fused + online_softmax).
The goal was to see whether the attention bubbles visible in the standalone example also appear when attention runs as part of a real decode layer.
What was profiled
tracrbranch runtime profiler (TraCR), capturing the AICPU scheduler + AICore execution timeline directly on device.Orchestrating,Scheduling,Phase1–4,Drain) and per-task execution (Running_Task_Single/Pair); kernels identified by task id.Key finding: the in-layer attention timeline has far fewer bubbles than the standalone example
Both settings compute the same paged-attention GQA, and the attention kernel itself has the same inherent bubbles in both. The difference is what else is on the timeline to hide them:
The decode layer dispatches ~30 kernels (
q/k/v_proj,qk_norm,rope_qkv,fa_fused,online_softmax,out_proj,gate/up/down_proj,silu, …). Attention is only 2 of them. The compute-bound projection/MLP matmuls keep the 24 AICube cores saturated, so whenever attention stalls, the scheduler has other independent work running on those cores.Standalone paged attention has nothing to overlap with, so its inherent stalls are fully exposed as bubbles:
fa_fusedadditionally pipelines internally (pl.pipeline(stage=2), overlapping iteration i+1's QK with iteration i's softmax), shrinking intra-attention gaps; whatever remains is absorbed by the surrounding layer.Takeaway
The bubbles are latency-hidden, not eliminated. A dense in-layer timeline is the desired outcome — it means the scheduler is successfully overlapping attention with the rest of the decode layer, which is what matters for end-to-end throughput. However:
Profile view
This is the file to visualize on Perfetto:
tracr_qwen3_14b_decode.json
How to recreate the plot
To create these profiles. One has to use the Simpler with TraCR branch, and when compiling Simpler, add the Env flag
BUILD_TRACR=ON pip install --no-build-isolation -e .. TraCR is a low-level profiler that captures traces on the Ascend device. Also, set_DEFAULT_RUNTIME = "tensormap_and_ringbuffer"inside thepypto/runtime/worker.pyas currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce atracr_0/in~/ascend/. This has to be post-processed by running this command:./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/It will generate a
tracr_0/perfetto.jsonfile, which can be viewed in Perfetto.Motivation / Use Case
We wanted to understand the on-device behavior of paged-attention GQA in pypto3 by comparing two settings:
Standalone — the
pypto/examples/models/04_paged_attention.pyexample (attention computed in isolation: a QK → softmax-prepare → SV kernel pipeline), as traced in #986.In-context — the same paged-attention GQA as it actually runs inside the full qwen3-14B decode layer (
fa_fused+online_softmax).The goal was to see whether the attention bubbles visible in the standalone example also appear when attention runs as part of a real decode layer.