[Code Health] Extract the AICPU-side profiling operation layer (enqueue/pop/switch/flush/record) into a templated device engine + per-subsystem trait — symmetric to host ProfilerAlgorithms

### Category

Technical Debt (cleanup, refactor)

### Component

Platform (a2a3 / a2a3sim)

### Description

The **host** side of the profiling framework is unified: `ProfilerBase` (poll/drain/collect loops) + `BufferPoolManager` (pools/queues) + `ProfilerAlgorithms<Module>` (the generic algorithm), one implementation for all five subsystems, each supplying a small `Module` trait. The **AICPU (device)** side has no such layer — each of L2Swimlane / PMU / DepGen / TensorDump / ScopeStats reimplements the same device-side logic in its own `*_aicpu.cpp` (per arch for the three arch-specific ones), 400–1000 lines each.

This issue is the **device-side analog of the host unification**: hoist the shared AICPU operation layer into one templated engine + per-subsystem trait. (An earlier framing scoped this to just the two low-level wait helpers; the real duplication is the whole operation layer above them.)

**The operation layer is structurally identical across all five, and each copy has drifted.** Every collector has an `enqueue_*_ready_buffer` (ready-queue push) and a `switch_*_buffer` (buffer rotation), and the switch skeleton is the same everywhere:

```
switch():
  1. null guards (state / current buffer)
  2. check free_queue for space (head==tail → drop: count dropped, reset, return)
  3. enqueue current full buffer to the ready queue
  4. pop a fresh buffer from free_queue (head+1, buffer_ptrs[head % SLOT_COUNT])
  5. install as current, reset count, wmb
```

The underlying queue layouts are already identical — the ready header exposes `queue_heads[]` / `queue_tails[]` / `queues[][]` and the free-queue exposes `head` / `tail` / `buffer_ptrs[]`, differing only in type name (`PmuDataHeader` / `DepGenDataHeader` / …). Because there is no shared engine, the copies have drifted: the backpressure poll-mask, trailing `wmb()`, top-of-loop `rmb()`, and null-slot handling all differed per copy until #1162 aligned them one file at a time — and **TensorDump's `switch_dump_meta_buffer` still uses an inline `DUMP_SPIN_WAIT_LIMIT` spin instead of the `wait_for_free_queue_entry` helper the others adopted**, a live example of the drift. Every future device-side change has to be made 5+ times and kept in sync by hand.

### Proposed structure

| Layer | Contents |
| :-- | :-- |
| **Device collector engine** (templated on a `Module` trait) | `enqueue_ready(buffer_ptr, seq)` (ready-queue push), `pop_free()` (free-queue pop-and-install), **`switch_buffer()`** (the enqueue-full + pop-fresh + install skeleton above), **`flush()`** (flush the partial current buffer at teardown), `record(...)` (append + switch-if-full hot path), and the init/finalize skeleton (set header/pool pointers, reset state) |
| **Per-subsystem trait** | header / free-queue / buffer types; buffer-kind count; `*_READYQUEUE_SIZE` / `*_SLOT_COUNT` / backpressure-cycle constants; the record-field-store hook; the drop-accounting hook; the instance shape (single-instance vs per-thread) |

Each `*_aicpu.cpp` then collapses to a trait plus a few subsystem-specific hooks, instead of a full re-implementation of enqueue/pop/switch/flush/commit. This also erases the current per-copy drift (the TensorDump spin, the divergent logging, the differing drop-accounting) by construction.

### Caveats

- **L2Swimlane is the outlier.** It has 4 buffer kinds, per-core task pools + per-thread phase pools + AICore rotation, plus `flush_phase_pool` / `switch_phase_buffer_kind`. The single-instance subsystems (DepGen, ScopeStats) and simple per-thread ones (PMU, TensorDump) collapse cleanly into the engine; L2's multi-pool structure has to compose several engine instances or keep some subsystem-specific code on top. Don't force it into the base if it distorts the common case.
- **Device memory-ordering code.** Every `rmb()` / `wmb()` must be preserved exactly; sim does not exercise weak-memory reordering, so this must be **onboard-validated on both arches**.

### Location

(Symbols only — files are a moving target.)

- Per-arch: `src/{a2a3,a5}/platform/shared/aicpu/{l2_swimlane_collector,pmu_collector,dep_gen_collector}_aicpu.cpp` — `enqueue_*_ready_buffer`, `switch_*_buffer` / `switch_records_buffer` / `switch_phase_buffer_kind`, `flush_phase_pool`, `try_pop_*`.
- Common: `src/common/platform/shared/aicpu/{tensor_dump_aicpu,scope_stats_collector_aicpu}.cpp` — `enqueue_*`, `switch_dump_meta_buffer` (note the inline `DUMP_SPIN_WAIT_LIMIT`), `switch_buffer`.
- Candidate home for the shared engine header: alongside `src/common/platform/include/aicpu/`.
- Host precedent to mirror: `src/common/platform/include/host/profiler_base.h` (`ProfilerAlgorithms<Module>` + the `Module` trait contract).

### Proposed Fix

Introduce one templated device-side collector engine (single source of truth) implementing enqueue/pop/switch/flush/record/init-finalize over a `Module` trait, included by all five AICPU collectors in place of the per-subsystem copies — the device analog of `ProfilerAlgorithms<Module>`. Mechanical, onboard-validated, preserving every barrier. Land after #1162 (which aligned the helper shapes and is the natural precursor).

### Priority

Low (no impact today, good to fix eventually)

Related:
- #1253 — the **arch axis** (a2a3↔a5) dedup: L2/DepGen AICPU are byte-identical and can be plain-moved to common. **This issue is the collector axis** (across subsystems). They overlap: sequence #1253's quick move first, or note that this engine extraction largely **subsumes** #1253 (once collectors become trait + hooks, the byte-identical per-arch files disappear anyway).
- #1237 — collector-output sharding on the host side (shares the static-ownership linchpin).
- #1251 — all-SPSC buffer-pool redesign (host-side pipeline); this issue is the device-side operation layer that pushes/pops those same queues.
- #1162 — aligned the wait-helper shapes; the precursor.

Layer	Contents
Device collector engine (templated on a `Module` trait)	`enqueue_ready(buffer_ptr, seq)` (ready-queue push), `pop_free()` (free-queue pop-and-install), `switch_buffer()` (the enqueue-full + pop-fresh + install skeleton above), `flush()` (flush the partial current buffer at teardown), `record(...)` (append + switch-if-full hot path), and the init/finalize skeleton (set header/pool pointers, reset state)
Per-subsystem trait	header / free-queue / buffer types; buffer-kind count; `_READYQUEUE_SIZE` / `_SLOT_COUNT` / backpressure-cycle constants; the record-field-store hook; the drop-accounting hook; the instance shape (single-instance vs per-thread)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Code Health] Extract the AICPU-side profiling operation layer (enqueue/pop/switch/flush/record) into a templated device engine + per-subsystem trait — symmetric to host ProfilerAlgorithms #1247

Category

Component

Description

Proposed structure

Caveats

Location

Proposed Fix

Priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Code Health] Extract the AICPU-side profiling operation layer (enqueue/pop/switch/flush/record) into a templated device engine + per-subsystem trait — symmetric to host ProfilerAlgorithms #1247

Description

Category

Component

Description

Proposed structure

Caveats

Location

Proposed Fix

Priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions