Skip to content

[Profiling Report] Profiling qwen3-14B decode attention #607

Description

@noabauma

What was profiled

  • Workload: qwen3-14B single decode layer (golden unit test), batch=16, varied sequence lengths, fp16/bf16, on a2a3 (910B), passed validation.
  • Tool: simpler tracr branch runtime profiler (TraCR), capturing the AICPU scheduler + AICore execution timeline directly on device.
  • Capture: 6,515 trace events across 77 channels — 4 AICPU scheduler threads, 24 AICube cores, 48 AIVector cores — with scheduler markers (Orchestrating, Scheduling, Phase1–4, Drain) and per-task execution (Running_Task_Single/Pair); kernels identified by task id.

Key finding: the in-layer attention timeline has far fewer bubbles than the standalone example

Both settings compute the same paged-attention GQA, and the attention kernel itself has the same inherent bubbles in both. The difference is what else is on the timeline to hide them:

  • The decode layer dispatches ~30 kernels (q/k/v_proj, qk_norm, rope_qkv, fa_fused, online_softmax, out_proj, gate/up/down_proj, silu, …). Attention is only 2 of them. The compute-bound projection/MLP matmuls keep the 24 AICube cores saturated, so whenever attention stalls, the scheduler has other independent work running on those cores.

  • Standalone paged attention has nothing to overlap with, so its inherent stalls are fully exposed as bubbles:

    • memory-bound — it streams the KV cache with low arithmetic intensity, so cores wait on HBM;
    • serial QK → softmax → SV chain that ping-pongs between AIC (cube matmuls) and AIV (vector softmax), each idling while the other runs;
    • low decode parallelism — only batch×kv_heads small work items, not enough to fill 24 cube + 48 vector cores.
  • fa_fused additionally pipelines internally (pl.pipeline(stage=2), overlapping iteration i+1's QK with iteration i's softmax), shrinking intra-attention gaps; whatever remains is absorbed by the surrounding layer.

Takeaway

The bubbles are latency-hidden, not eliminated. A dense in-layer timeline is the desired outcome — it means the scheduler is successfully overlapping attention with the rest of the decode layer, which is what matters for end-to-end throughput. However:

  • For optimizing paged attention itself, the standalone example is the better diagnostic — it exposes the real attention bottleneck (memory-bound + AIC↔AIV serialization) that the full-layer view masks.
  • Any isolated attention latency target (e.g. the fa_fused + softmax budget) should be measured standalone; inside the full layer, part of attention's cost is hidden behind MLP/projection work, so an in-layer reading understates it.

Profile view

Image

This is the file to visualize on Perfetto:
tracr_qwen3_14b_decode.json

How to recreate the plot

To create these profiles. One has to use the Simpler with TraCR branch, and when compiling Simpler, add the Env flag BUILD_TRACR=ON pip install --no-build-isolation -e .. TraCR is a low-level profiler that captures traces on the Ascend device. Also, set _DEFAULT_RUNTIME = "tensormap_and_ringbuffer" inside the pypto/runtime/worker.py as currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce a tracr_0/ in ~/ascend/. This has to be post-processed by running this command:

./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/

It will generate a tracr_0/perfetto.json file, which can be viewed in Perfetto.

Motivation / Use Case

We wanted to understand the on-device behavior of paged-attention GQA in pypto3 by comparing two settings:

Standalone — the pypto/examples/models/04_paged_attention.py example (attention computed in isolation: a QK → softmax-prepare → SV kernel pipeline), as traced in #986.
In-context — the same paged-attention GQA as it actually runs inside the full qwen3-14B decode layer (fa_fused + online_softmax).
The goal was to see whether the attention bubbles visible in the standalone example also appear when attention runs as part of a real decode layer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions