Summary
Captured AICPU/host-runtime-level traces for the standalone decode_attention_csa.py (DeepSeek V4 CSA decode-attention) kernel using the TraCR-enabled simpler runtime.
Setup
Kernel ran standalone on a single Ascend 910B2 NPU via the pypto-lib JIT harness (run_jit).
Simpler built from the TraCR branch with BUILD_TRACR=ON.
Result
Kernel passed golden validation against the torch reference — kv_cache and x_out both PASS, so the traces reflect a correct, complete run.
TraCR recorded the simpler task-graph scheduler across the 4 AICPU threads.
Profile view
This is the file to visualize on Perfetto:
decode_attention_csa.json
How to recreate the plot
To create these profiles. One has to use the Simpler with TraCR branch, and when compiling Simpler, add the Env flag BUILD_TRACR=ON pip install --no-build-isolation -e .. TraCR is a low-level profiler that captures traces on the Ascend device. Also, set _DEFAULT_RUNTIME = "tensormap_and_ringbuffer" inside the pypto/runtime/worker.py as currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce a tracr_0/ in ~/ascend/. This has to be post-processed by running this command:
./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/
It will generate a tracr_0/perfetto.json file, which can be viewed in Perfetto.
Motivation / Use Case
Showing the current state of the decode_attention_csa.py (DeepSeek V4 CSA decode-attention) kernel using TraCR.
Validate that TraCR (simpler tracr branch, BUILD_TRACR=ON) can profile a real pypto-lib kernel end-to-end on actual NPU hardware — capturing AICPU/host-runtime traces from a standalone, single-device kernel run — as a first step toward profiling larger models.
Summary
Captured AICPU/host-runtime-level traces for the standalone
decode_attention_csa.py(DeepSeek V4 CSA decode-attention) kernel using the TraCR-enabled simpler runtime.Setup
Kernel ran standalone on a single Ascend 910B2 NPU via the pypto-lib JIT harness (
run_jit).Simpler built from the TraCR branch with
BUILD_TRACR=ON.Result
Kernel passed golden validation against the torch reference —
kv_cacheandx_outboth PASS, so the traces reflect a correct, complete run.TraCR recorded the simpler task-graph scheduler across the 4 AICPU threads.
Profile view
This is the file to visualize on Perfetto:
decode_attention_csa.json
How to recreate the plot
To create these profiles. One has to use the Simpler with TraCR branch, and when compiling Simpler, add the Env flag
BUILD_TRACR=ON pip install --no-build-isolation -e .. TraCR is a low-level profiler that captures traces on the Ascend device. Also, set_DEFAULT_RUNTIME = "tensormap_and_ringbuffer"inside thepypto/runtime/worker.pyas currently, TraCR is only built on top of this Simpler runtime scheduler. PyPTO should also be built based on this Simpler version; otherwise, PyPTO-based examples will not capture TraCR traces. When running PyPTO or Simpler examples, TraCR will produce atracr_0/in~/ascend/. This has to be post-processed by running this command:./pypto/runtime/build/output/bin/tracr_process ~/ascend/tracr_0/It will generate a
tracr_0/perfetto.jsonfile, which can be viewed in Perfetto.Motivation / Use Case
Showing the current state of the
decode_attention_csa.py(DeepSeek V4 CSA decode-attention) kernel using TraCR.Validate that TraCR (simpler
tracrbranch,BUILD_TRACR=ON) can profile a real pypto-lib kernel end-to-end on actual NPU hardware — capturing AICPU/host-runtime traces from a standalone, single-device kernel run — as a first step toward profiling larger models.