Skip to content

[Code Health] Hoist L2Swimlane & DepGen collectors into common (like TensorDump/ScopeStats): move identical AICPU, migrate host to alloc_paired_buffer #1253

Description

@ChaoZheng109

Category

Technical Debt (cleanup, refactor)

Component

Platform (a2a3 / a2a3sim)

Description

TensorDump and ScopeStats already live entirely in src/common/ (aicpu + host + header) — one source compiled for both arches. L2Swimlane, PMU, and DepGen are still duplicated per-arch under src/{a2a3,a5}/platform/shared/. This issue covers L2Swimlane and DepGen (PMU is genuinely arch-divergent — separate effort, see below).

Measured on main (4d5fbe4):

file a2a3 a5 diff lines
aicpu/dep_gen_collector_aicpu.cpp 378 378 0 (byte-identical)
aicpu/l2_swimlane_collector_aicpu.cpp 941 941 0 (byte-identical)
host/dep_gen_collector.cpp 275 296 149
host/l2_swimlane_collector.cpp 1019 1060 187

AICPU side — pure duplication. Both files are byte-identical across arch: the device-side writer only touches its own device view of shared memory, so it is transport-agnostic; all arch differences (struct sizes, PLATFORM_* constants, register addresses) already resolve through arch-specific headers at build time. There is no reason for two copies.

Host side — diverges only on transport, and the abstraction to remove that already exists. The divergence is entirely SVM vs host-shadow: a2a3 inlines raw alloc_cb + register_cb (halHostRegister) + register_mapping, while a5 uses alloc_paired_buffer + profiling_copy_to/from_device. The bulk (record copy-out, reconcile, export) is the same logic. TensorDump/ScopeStats already hide this behind ProfilerBase::alloc_paired_buffer (which branches internally: halHostRegister / non-SVM malloc-shadow+copy / SVM identity-map) plus profiling_copy_*_or_null() (platform decides whether copy callbacks exist). a2a3 dep_gen host still calls alloc_paired_buffer 0 times; a5 calls it 3 times — the a2a3 side simply was never migrated to the abstraction.

Location

(Symbols only.)

  • AICPU (identical → move): src/{a2a3,a5}/platform/shared/aicpu/{l2_swimlane_collector_aicpu,dep_gen_collector_aicpu}.cpp
  • Host (transport-diverged → migrate): src/{a2a3,a5}/platform/shared/host/{l2_swimlane_collector,dep_gen_collector}.cpp
  • Reference implementations already in common: src/common/platform/shared/{aicpu,host}/{tensor_dump_aicpu,tensor_dump_collector,scope_stats_collector_aicpu,scope_stats_collector}.*
  • Abstraction to adopt: src/common/platform/include/host/profiler_base.h (alloc_paired_buffer), src/common/platform/include/host/profiling_copy.h (profiling_copy_*_or_null), src/common/platform/include/host/buffer_pool_manager.h.

Proposed Fix

Two independent, incrementally-landable steps, mirroring how TensorDump/ScopeStats are already structured:

  1. AICPU: move to common (plain relocation). Move the two byte-identical *_aicpu.cpp to src/common/platform/shared/aicpu/, delete the per-arch copies, and wire the build to compile the common source per-arch via include paths (same mechanism tensor_dump_aicpu.cpp already uses). Mechanical; verify the arch-specific headers still resolve.
  2. Host: migrate to the transport abstraction, then collapse. Refactor the a2a3 host to use alloc_paired_buffer + profiling_copy_*_or_null() instead of inline alloc_cb/register_cb/register_mapping, so both arches take one code path; then merge the two host .cpp into one common file (leaving only the platform-resolved copy-callback wiring, exactly as TensorDump does).

PMU is out of scope here: its AICPU side genuinely diverges (a5 AICore PMU staging-ring, 10 vs 8 counters, dual CTRL registers) and its host adds counter-count differences — that needs a common-skeleton + arch-hooks approach, not a plain move. Track PMU with the device-side unification effort.

Device memory-ordering / transport change → must be onboard-validated on both arches (sim does not exercise the SVM-vs-shadow transport difference).

Priority

Low (no impact today, good to fix eventually)

Related: #1247 (device-side plumbing unification — the collector axis; this issue is the arch axis), #1251 (all-SPSC buffer-pool redesign). Reference precedent: TensorDump & ScopeStats are already fully common.

Metadata

Metadata

Assignees

No one assigned

    Labels

    code healthTechnical debt, robustness, code quality

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions