Summary
At the matched qwen3 attention shape, the remaining pypto-vs-CCE gap is ~2.3x
(on-core attention time) — down from the ~6× in #986.
Comparison
Like-for-like at the qwen3-14B attention shape — batch 16, 40 q-heads, 8 KV
(5:1 GQA), head_dim 128, context 4096, block 128, fp16:
- CCE:
spmd_paged_attention_highperf case b16_h40_kv8_s4096_bs128_fp16
(added to match qwen3; compiles + passes golden).
- pypto: in-layer
fa_fused from the qwen3-14B decode layer.
Since fa_fused runs inside the full decode layer (overlapped with the other
kernels), the fair metric is the attention kernels' on-core (AICube/AIVector)
busy time from the TraCR traces, not wall-clock.
| metric |
CCE b16_h40_kv8_s4096_bs128_fp16 |
pypto fa_fused + --max-seq |
gap |
| on-core attention time |
~548us |
~1263us |
~2.3x |
Profile view
LEFT: CCE b16_h40_kv8_s4096_bs128_fp16 RIGHT: pypto fa_fused + --max-seq

This is the file to visualize on Perfetto:
cce_b16_h40_kv8_s4096_bs128_fp16.json
qwen3_14b_decode.json
How to recreate the plot
To create these profiles. One has to use the Simpler with TraCR branch, and when compiling Simpler, add the Env flag BUILD_TRACR=ON pip install --no-build-isolation -e .. TraCR is a low-level profiler that captures traces on the Ascend device. Also, set _DEFAULT_RUNTIME = "tensormap_and_ringbuffer" inside the pypto/runtime/worker.py as currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce a tracr_0/ in ~/ascend/. This has to be post-processed by running this command:
./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/
It will generate a tracr_0/perfetto.json file, which can be viewed in Perfetto.
Motivation / Use Case
#986
showed the standalone pypto paged-attention example running ~6× slower than
the optimized CCE kernel. With the optimized attention from
pypto-lib#607, the gap
should be much smaller — the question is how much: 20%? 50%? This is the setup
and baseline for answering that quantitatively against the optimized CCE
implementation in simpler#899
(spmd_paged_attention_highperf), profiled on a2a3 (910B) with the simpler/TraCR
runtime profiler.
Summary
At the matched qwen3 attention shape, the remaining pypto-vs-CCE gap is ~2.3x
(on-core attention time) — down from the ~6× in #986.
Comparison
Like-for-like at the qwen3-14B attention shape — batch 16, 40 q-heads, 8 KV
(5:1 GQA), head_dim 128, context 4096, block 128, fp16:
spmd_paged_attention_highperfcaseb16_h40_kv8_s4096_bs128_fp16(added to match qwen3; compiles + passes golden).
fa_fusedfrom the qwen3-14B decode layer.Since
fa_fusedruns inside the full decode layer (overlapped with the otherkernels), the fair metric is the attention kernels' on-core (AICube/AIVector)
busy time from the TraCR traces, not wall-clock.
b16_h40_kv8_s4096_bs128_fp16fa_fused+--max-seqProfile view
LEFT: CCE

b16_h40_kv8_s4096_bs128_fp16RIGHT: pyptofa_fused+--max-seqThis is the file to visualize on Perfetto:
cce_b16_h40_kv8_s4096_bs128_fp16.json
qwen3_14b_decode.json
How to recreate the plot
To create these profiles. One has to use the Simpler with TraCR branch, and when compiling Simpler, add the Env flag
BUILD_TRACR=ON pip install --no-build-isolation -e .. TraCR is a low-level profiler that captures traces on the Ascend device. Also, set_DEFAULT_RUNTIME = "tensormap_and_ringbuffer"inside thepypto/runtime/worker.pyas currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce atracr_0/in~/ascend/. This has to be post-processed by running this command:./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/It will generate a
tracr_0/perfetto.jsonfile, which can be viewed in Perfetto.Motivation / Use Case
#986
showed the standalone pypto paged-attention example running ~6× slower than
the optimized CCE kernel. With the optimized attention from
pypto-lib#607, the gap
should be much smaller — the question is how much: 20%? 50%? This is the setup
and baseline for answering that quantitatively against the optimized CCE
implementation in simpler#899
(
spmd_paged_attention_highperf), profiled on a2a3 (910B) with the simpler/TraCRruntime profiler.