Skip to content

[Profiling Report] Quantifying the pypto ↔ CCE paged-attention gap (follow-up to the 6× in simpler/pull: #986) #622

Description

@noabauma

Summary

At the matched qwen3 attention shape, the remaining pypto-vs-CCE gap is ~2.3x
(on-core attention time) — down from the ~6× in #986.

Comparison

Like-for-like at the qwen3-14B attention shape — batch 16, 40 q-heads, 8 KV
(5:1 GQA), head_dim 128, context 4096, block 128, fp16
:

  • CCE: spmd_paged_attention_highperf case b16_h40_kv8_s4096_bs128_fp16
    (added to match qwen3; compiles + passes golden).
  • pypto: in-layer fa_fused from the qwen3-14B decode layer.

Since fa_fused runs inside the full decode layer (overlapped with the other
kernels), the fair metric is the attention kernels' on-core (AICube/AIVector)
busy time
from the TraCR traces, not wall-clock.

metric CCE b16_h40_kv8_s4096_bs128_fp16 pypto fa_fused + --max-seq gap
on-core attention time ~548us ~1263us ~2.3x

Profile view

LEFT: CCE b16_h40_kv8_s4096_bs128_fp16 RIGHT: pypto fa_fused + --max-seq
Image

This is the file to visualize on Perfetto:

cce_b16_h40_kv8_s4096_bs128_fp16.json

qwen3_14b_decode.json

How to recreate the plot

To create these profiles. One has to use the Simpler with TraCR branch, and when compiling Simpler, add the Env flag BUILD_TRACR=ON pip install --no-build-isolation -e .. TraCR is a low-level profiler that captures traces on the Ascend device. Also, set _DEFAULT_RUNTIME = "tensormap_and_ringbuffer" inside the pypto/runtime/worker.py as currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce a tracr_0/ in ~/ascend/. This has to be post-processed by running this command:

./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/

It will generate a tracr_0/perfetto.json file, which can be viewed in Perfetto.

Motivation / Use Case

#986
showed the standalone pypto paged-attention example running ~ slower than
the optimized CCE kernel. With the optimized attention from
pypto-lib#607, the gap
should be much smaller — the question is how much: 20%? 50%? This is the setup
and baseline for answering that quantitatively against the optimized CCE
implementation in simpler#899
(spmd_paged_attention_highperf), profiled on a2a3 (910B) with the simpler/TraCR
runtime profiler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions