Category
Technical Debt (cleanup, refactor)
Component
Platform (a2a3 / a2a3sim)
Description
TensorDump and ScopeStats already live entirely in src/common/ (aicpu + host + header) — one source compiled for both arches. L2Swimlane, PMU, and DepGen are still duplicated per-arch under src/{a2a3,a5}/platform/shared/. This issue covers L2Swimlane and DepGen (PMU is genuinely arch-divergent — separate effort, see below).
Measured on main (4d5fbe4):
| file |
a2a3 |
a5 |
diff lines |
aicpu/dep_gen_collector_aicpu.cpp |
378 |
378 |
0 (byte-identical) |
aicpu/l2_swimlane_collector_aicpu.cpp |
941 |
941 |
0 (byte-identical) |
host/dep_gen_collector.cpp |
275 |
296 |
149 |
host/l2_swimlane_collector.cpp |
1019 |
1060 |
187 |
AICPU side — pure duplication. Both files are byte-identical across arch: the device-side writer only touches its own device view of shared memory, so it is transport-agnostic; all arch differences (struct sizes, PLATFORM_* constants, register addresses) already resolve through arch-specific headers at build time. There is no reason for two copies.
Host side — diverges only on transport, and the abstraction to remove that already exists. The divergence is entirely SVM vs host-shadow: a2a3 inlines raw alloc_cb + register_cb (halHostRegister) + register_mapping, while a5 uses alloc_paired_buffer + profiling_copy_to/from_device. The bulk (record copy-out, reconcile, export) is the same logic. TensorDump/ScopeStats already hide this behind ProfilerBase::alloc_paired_buffer (which branches internally: halHostRegister / non-SVM malloc-shadow+copy / SVM identity-map) plus profiling_copy_*_or_null() (platform decides whether copy callbacks exist). a2a3 dep_gen host still calls alloc_paired_buffer 0 times; a5 calls it 3 times — the a2a3 side simply was never migrated to the abstraction.
Location
(Symbols only.)
- AICPU (identical → move):
src/{a2a3,a5}/platform/shared/aicpu/{l2_swimlane_collector_aicpu,dep_gen_collector_aicpu}.cpp
- Host (transport-diverged → migrate):
src/{a2a3,a5}/platform/shared/host/{l2_swimlane_collector,dep_gen_collector}.cpp
- Reference implementations already in common:
src/common/platform/shared/{aicpu,host}/{tensor_dump_aicpu,tensor_dump_collector,scope_stats_collector_aicpu,scope_stats_collector}.*
- Abstraction to adopt:
src/common/platform/include/host/profiler_base.h (alloc_paired_buffer), src/common/platform/include/host/profiling_copy.h (profiling_copy_*_or_null), src/common/platform/include/host/buffer_pool_manager.h.
Proposed Fix
Two independent, incrementally-landable steps, mirroring how TensorDump/ScopeStats are already structured:
- AICPU: move to common (plain relocation). Move the two byte-identical
*_aicpu.cpp to src/common/platform/shared/aicpu/, delete the per-arch copies, and wire the build to compile the common source per-arch via include paths (same mechanism tensor_dump_aicpu.cpp already uses). Mechanical; verify the arch-specific headers still resolve.
- Host: migrate to the transport abstraction, then collapse. Refactor the a2a3 host to use
alloc_paired_buffer + profiling_copy_*_or_null() instead of inline alloc_cb/register_cb/register_mapping, so both arches take one code path; then merge the two host .cpp into one common file (leaving only the platform-resolved copy-callback wiring, exactly as TensorDump does).
PMU is out of scope here: its AICPU side genuinely diverges (a5 AICore PMU staging-ring, 10 vs 8 counters, dual CTRL registers) and its host adds counter-count differences — that needs a common-skeleton + arch-hooks approach, not a plain move. Track PMU with the device-side unification effort.
Device memory-ordering / transport change → must be onboard-validated on both arches (sim does not exercise the SVM-vs-shadow transport difference).
Priority
Low (no impact today, good to fix eventually)
Related: #1247 (device-side plumbing unification — the collector axis; this issue is the arch axis), #1251 (all-SPSC buffer-pool redesign). Reference precedent: TensorDump & ScopeStats are already fully common.
Category
Technical Debt (cleanup, refactor)
Component
Platform (a2a3 / a2a3sim)
Description
TensorDump and ScopeStats already live entirely in
src/common/(aicpu + host + header) — one source compiled for both arches. L2Swimlane, PMU, and DepGen are still duplicated per-arch undersrc/{a2a3,a5}/platform/shared/. This issue covers L2Swimlane and DepGen (PMU is genuinely arch-divergent — separate effort, see below).Measured on
main(4d5fbe4):aicpu/dep_gen_collector_aicpu.cppaicpu/l2_swimlane_collector_aicpu.cpphost/dep_gen_collector.cpphost/l2_swimlane_collector.cppAICPU side — pure duplication. Both files are byte-identical across arch: the device-side writer only touches its own device view of shared memory, so it is transport-agnostic; all arch differences (struct sizes,
PLATFORM_*constants, register addresses) already resolve through arch-specific headers at build time. There is no reason for two copies.Host side — diverges only on transport, and the abstraction to remove that already exists. The divergence is entirely SVM vs host-shadow: a2a3 inlines raw
alloc_cb+register_cb(halHostRegister) +register_mapping, while a5 usesalloc_paired_buffer+profiling_copy_to/from_device. The bulk (record copy-out, reconcile, export) is the same logic. TensorDump/ScopeStats already hide this behindProfilerBase::alloc_paired_buffer(which branches internally: halHostRegister / non-SVM malloc-shadow+copy / SVM identity-map) plusprofiling_copy_*_or_null()(platform decides whether copy callbacks exist). a2a3 dep_gen host still callsalloc_paired_buffer0 times; a5 calls it 3 times — the a2a3 side simply was never migrated to the abstraction.Location
(Symbols only.)
src/{a2a3,a5}/platform/shared/aicpu/{l2_swimlane_collector_aicpu,dep_gen_collector_aicpu}.cppsrc/{a2a3,a5}/platform/shared/host/{l2_swimlane_collector,dep_gen_collector}.cppsrc/common/platform/shared/{aicpu,host}/{tensor_dump_aicpu,tensor_dump_collector,scope_stats_collector_aicpu,scope_stats_collector}.*src/common/platform/include/host/profiler_base.h(alloc_paired_buffer),src/common/platform/include/host/profiling_copy.h(profiling_copy_*_or_null),src/common/platform/include/host/buffer_pool_manager.h.Proposed Fix
Two independent, incrementally-landable steps, mirroring how TensorDump/ScopeStats are already structured:
*_aicpu.cpptosrc/common/platform/shared/aicpu/, delete the per-arch copies, and wire the build to compile the common source per-arch via include paths (same mechanismtensor_dump_aicpu.cppalready uses). Mechanical; verify the arch-specific headers still resolve.alloc_paired_buffer+profiling_copy_*_or_null()instead of inlinealloc_cb/register_cb/register_mapping, so both arches take one code path; then merge the two host.cppinto one common file (leaving only the platform-resolved copy-callback wiring, exactly as TensorDump does).PMU is out of scope here: its AICPU side genuinely diverges (a5 AICore PMU staging-ring, 10 vs 8 counters, dual CTRL registers) and its host adds counter-count differences — that needs a common-skeleton + arch-hooks approach, not a plain move. Track PMU with the device-side unification effort.
Device memory-ordering / transport change → must be onboard-validated on both arches (sim does not exercise the SVM-vs-shadow transport difference).
Priority
Low (no impact today, good to fix eventually)
Related: #1247 (device-side plumbing unification — the collector axis; this issue is the arch axis), #1251 (all-SPSC buffer-pool redesign). Reference precedent: TensorDump & ScopeStats are already fully common.