[Code Health] Hoist L2Swimlane & DepGen collectors into common (like TensorDump/ScopeStats): move identical AICPU, migrate host to alloc_paired_buffer

### Category

Technical Debt (cleanup, refactor)

### Component

Platform (a2a3 / a2a3sim)

### Description

TensorDump and ScopeStats already live entirely in `src/common/` (aicpu + host + header) — one source compiled for both arches. L2Swimlane, PMU, and DepGen are still duplicated per-arch under `src/{a2a3,a5}/platform/shared/`. This issue covers **L2Swimlane and DepGen** (PMU is genuinely arch-divergent — separate effort, see below).

Measured on `main` (4d5fbe4c):

| file | a2a3 | a5 | diff lines |
| :-- | :-- | :-- | :-- |
| `aicpu/dep_gen_collector_aicpu.cpp` | 378 | 378 | **0 (byte-identical)** |
| `aicpu/l2_swimlane_collector_aicpu.cpp` | 941 | 941 | **0 (byte-identical)** |
| `host/dep_gen_collector.cpp` | 275 | 296 | 149 |
| `host/l2_swimlane_collector.cpp` | 1019 | 1060 | 187 |

**AICPU side — pure duplication.** Both files are byte-identical across arch: the device-side writer only touches its own device view of shared memory, so it is transport-agnostic; all arch differences (struct sizes, `PLATFORM_*` constants, register addresses) already resolve through arch-specific headers at build time. There is no reason for two copies.

**Host side — diverges only on transport, and the abstraction to remove that already exists.** The divergence is entirely SVM vs host-shadow: a2a3 inlines raw `alloc_cb` + `register_cb` (`halHostRegister`) + `register_mapping`, while a5 uses `alloc_paired_buffer` + `profiling_copy_to/from_device`. The bulk (record copy-out, reconcile, export) is the same logic. TensorDump/ScopeStats already hide this behind `ProfilerBase::alloc_paired_buffer` (which branches internally: halHostRegister / non-SVM malloc-shadow+copy / SVM identity-map) plus `profiling_copy_*_or_null()` (platform decides whether copy callbacks exist). a2a3 dep_gen host still calls `alloc_paired_buffer` **0** times; a5 calls it 3 times — the a2a3 side simply was never migrated to the abstraction.

### Location

(Symbols only.)

- AICPU (identical → move): `src/{a2a3,a5}/platform/shared/aicpu/{l2_swimlane_collector_aicpu,dep_gen_collector_aicpu}.cpp`
- Host (transport-diverged → migrate): `src/{a2a3,a5}/platform/shared/host/{l2_swimlane_collector,dep_gen_collector}.cpp`
- Reference implementations already in common: `src/common/platform/shared/{aicpu,host}/{tensor_dump_aicpu,tensor_dump_collector,scope_stats_collector_aicpu,scope_stats_collector}.*`
- Abstraction to adopt: `src/common/platform/include/host/profiler_base.h` (`alloc_paired_buffer`), `src/common/platform/include/host/profiling_copy.h` (`profiling_copy_*_or_null`), `src/common/platform/include/host/buffer_pool_manager.h`.

### Proposed Fix

Two independent, incrementally-landable steps, mirroring how TensorDump/ScopeStats are already structured:

1. **AICPU: move to common (plain relocation).** Move the two byte-identical `*_aicpu.cpp` to `src/common/platform/shared/aicpu/`, delete the per-arch copies, and wire the build to compile the common source per-arch via include paths (same mechanism `tensor_dump_aicpu.cpp` already uses). Mechanical; verify the arch-specific headers still resolve.
2. **Host: migrate to the transport abstraction, then collapse.** Refactor the a2a3 host to use `alloc_paired_buffer` + `profiling_copy_*_or_null()` instead of inline `alloc_cb`/`register_cb`/`register_mapping`, so both arches take one code path; then merge the two host `.cpp` into one common file (leaving only the platform-resolved copy-callback wiring, exactly as TensorDump does).

PMU is out of scope here: its AICPU side genuinely diverges (a5 AICore PMU staging-ring, 10 vs 8 counters, dual CTRL registers) and its host adds counter-count differences — that needs a common-skeleton + arch-hooks approach, not a plain move. Track PMU with the device-side unification effort.

Device memory-ordering / transport change → must be onboard-validated on both arches (sim does not exercise the SVM-vs-shadow transport difference).

### Priority

Low (no impact today, good to fix eventually)

Related: #1247 (device-side plumbing unification — the collector axis; this issue is the arch axis), #1251 (all-SPSC buffer-pool redesign). Reference precedent: TensorDump & ScopeStats are already fully common.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Code Health] Hoist L2Swimlane & DepGen collectors into common (like TensorDump/ScopeStats): move identical AICPU, migrate host to alloc_paired_buffer #1253

Category

Component

Description

Location

Proposed Fix

Priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

file	a2a3	a5	diff lines
`aicpu/dep_gen_collector_aicpu.cpp`	378	378	0 (byte-identical)
`aicpu/l2_swimlane_collector_aicpu.cpp`	941	941	0 (byte-identical)
`host/dep_gen_collector.cpp`	275	296	149
`host/l2_swimlane_collector.cpp`	1019	1060	187

Uh oh!

[Code Health] Hoist L2Swimlane & DepGen collectors into common (like TensorDump/ScopeStats): move identical AICPU, migrate host to alloc_paired_buffer #1253

Description

Category

Component

Description

Location

Proposed Fix

Priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions