[Profiling Report] Quantifying the pypto ↔ CCE paged-attention gap (follow-up to the 6× in simpler/pull: #986)

### Summary

At the matched qwen3 attention shape, the remaining pypto-vs-CCE gap is ~2.3x
(on-core attention time) — down from the ~6× in [#986](https://github.com/hw-native-sys/simpler/pull/986).

### Comparison

Like-for-like at the qwen3-14B attention shape — **batch 16, 40 q-heads, 8 KV
(5:1 GQA), head_dim 128, context 4096, block 128, fp16**:

- **CCE**: `spmd_paged_attention_highperf` case `b16_h40_kv8_s4096_bs128_fp16`
  (added to match qwen3; compiles + passes golden).
- **pypto**: in-layer `fa_fused` from the qwen3-14B decode layer.

Since `fa_fused` runs inside the full decode layer (overlapped with the other
kernels), the fair metric is the attention kernels' **on-core (AICube/AIVector)
busy time** from the TraCR traces, not wall-clock.

| metric | CCE `b16_h40_kv8_s4096_bs128_fp16` | pypto `fa_fused` + `--max-seq` | gap |
|---|---|---|---|
| on-core attention time | ~548us | ~1263us | ~2.3x |

### Profile view

LEFT: CCE `b16_h40_kv8_s4096_bs128_fp16`      RIGHT: pypto `fa_fused` + `--max-seq`
<img width="1916" height="1042" alt="Image" src="https://github.com/user-attachments/assets/03a2e82c-a02f-4fca-8c71-0cd44190982f" />

This is the file to visualize on [Perfetto](https://ui.perfetto.dev/):

[cce_b16_h40_kv8_s4096_bs128_fp16.json](https://github.com/user-attachments/files/29375167/cce_b16_h40_kv8_s4096_bs128_fp16.json)

[qwen3_14b_decode.json](https://github.com/user-attachments/files/29375169/qwen3_14b_decode.json)

### How to recreate the plot

To create these profiles. One has to use the [Simpler with TraCR](https://github.com/huawei-csl/simpler/tree/tracr) branch, and when compiling Simpler, add the Env flag `BUILD_TRACR=ON pip install --no-build-isolation -e .`. [TraCR](https://github.com/huawei-csl/TracR) is a low-level profiler that captures traces on the Ascend device. Also, set `_DEFAULT_RUNTIME = "tensormap_and_ringbuffer"` inside the `pypto/runtime/worker.py` as currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce a `tracr_0/` in `~/ascend/`. This has to be post-processed by running this command:

`./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/`

It will generate a `tracr_0/perfetto.json` file, which can be viewed in [Perfetto](https://ui.perfetto.dev/).

### Motivation / Use Case

[#986](https://github.com/hw-native-sys/simpler/pull/986#issuecomment-4742742424)
showed the standalone pypto paged-attention example running ~**6×** slower than
the optimized CCE kernel. With the optimized attention from
[pypto-lib#607](https://github.com/hw-native-sys/pypto-lib/issues/607), the gap
should be much smaller — the question is *how much*: 20%? 50%? This is the setup
and baseline for answering that quantitatively against the optimized CCE
implementation in [simpler#899](https://github.com/hw-native-sys/simpler/pull/899)
(`spmd_paged_attention_highperf`), profiled on a2a3 (910B) with the simpler/TraCR
runtime profiler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Profiling Report] Quantifying the pypto ↔ CCE paged-attention gap (follow-up to the 6× in simpler/pull: #986) #622

Summary

Comparison

Profile view

How to recreate the plot

Motivation / Use Case

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Profiling Report] Quantifying the pypto ↔ CCE paged-attention gap (follow-up to the 6× in simpler/pull: #986) #622

Description

Summary

Comparison

Profile view

How to recreate the plot

Motivation / Use Case

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions