Category
Technical Debt (cleanup, refactor)
Component
Platform (a2a3 / a2a3sim)
Description
The host side of the profiling framework is unified: ProfilerBase (poll/drain/collect loops) + BufferPoolManager (pools/queues) + ProfilerAlgorithms<Module> (the generic algorithm), one implementation for all five subsystems, each supplying a small Module trait. The AICPU (device) side has no such layer — each of L2Swimlane / PMU / DepGen / TensorDump / ScopeStats reimplements the same device-side logic in its own *_aicpu.cpp (per arch for the three arch-specific ones), 400–1000 lines each.
This issue is the device-side analog of the host unification: hoist the shared AICPU operation layer into one templated engine + per-subsystem trait. (An earlier framing scoped this to just the two low-level wait helpers; the real duplication is the whole operation layer above them.)
The operation layer is structurally identical across all five, and each copy has drifted. Every collector has an enqueue_*_ready_buffer (ready-queue push) and a switch_*_buffer (buffer rotation), and the switch skeleton is the same everywhere:
switch():
1. null guards (state / current buffer)
2. check free_queue for space (head==tail → drop: count dropped, reset, return)
3. enqueue current full buffer to the ready queue
4. pop a fresh buffer from free_queue (head+1, buffer_ptrs[head % SLOT_COUNT])
5. install as current, reset count, wmb
The underlying queue layouts are already identical — the ready header exposes queue_heads[] / queue_tails[] / queues[][] and the free-queue exposes head / tail / buffer_ptrs[], differing only in type name (PmuDataHeader / DepGenDataHeader / …). Because there is no shared engine, the copies have drifted: the backpressure poll-mask, trailing wmb(), top-of-loop rmb(), and null-slot handling all differed per copy until #1162 aligned them one file at a time — and TensorDump's switch_dump_meta_buffer still uses an inline DUMP_SPIN_WAIT_LIMIT spin instead of the wait_for_free_queue_entry helper the others adopted, a live example of the drift. Every future device-side change has to be made 5+ times and kept in sync by hand.
Proposed structure
| Layer |
Contents |
Device collector engine (templated on a Module trait) |
enqueue_ready(buffer_ptr, seq) (ready-queue push), pop_free() (free-queue pop-and-install), switch_buffer() (the enqueue-full + pop-fresh + install skeleton above), flush() (flush the partial current buffer at teardown), record(...) (append + switch-if-full hot path), and the init/finalize skeleton (set header/pool pointers, reset state) |
| Per-subsystem trait |
header / free-queue / buffer types; buffer-kind count; *_READYQUEUE_SIZE / *_SLOT_COUNT / backpressure-cycle constants; the record-field-store hook; the drop-accounting hook; the instance shape (single-instance vs per-thread) |
Each *_aicpu.cpp then collapses to a trait plus a few subsystem-specific hooks, instead of a full re-implementation of enqueue/pop/switch/flush/commit. This also erases the current per-copy drift (the TensorDump spin, the divergent logging, the differing drop-accounting) by construction.
Caveats
- L2Swimlane is the outlier. It has 4 buffer kinds, per-core task pools + per-thread phase pools + AICore rotation, plus
flush_phase_pool / switch_phase_buffer_kind. The single-instance subsystems (DepGen, ScopeStats) and simple per-thread ones (PMU, TensorDump) collapse cleanly into the engine; L2's multi-pool structure has to compose several engine instances or keep some subsystem-specific code on top. Don't force it into the base if it distorts the common case.
- Device memory-ordering code. Every
rmb() / wmb() must be preserved exactly; sim does not exercise weak-memory reordering, so this must be onboard-validated on both arches.
Location
(Symbols only — files are a moving target.)
- Per-arch:
src/{a2a3,a5}/platform/shared/aicpu/{l2_swimlane_collector,pmu_collector,dep_gen_collector}_aicpu.cpp — enqueue_*_ready_buffer, switch_*_buffer / switch_records_buffer / switch_phase_buffer_kind, flush_phase_pool, try_pop_*.
- Common:
src/common/platform/shared/aicpu/{tensor_dump_aicpu,scope_stats_collector_aicpu}.cpp — enqueue_*, switch_dump_meta_buffer (note the inline DUMP_SPIN_WAIT_LIMIT), switch_buffer.
- Candidate home for the shared engine header: alongside
src/common/platform/include/aicpu/.
- Host precedent to mirror:
src/common/platform/include/host/profiler_base.h (ProfilerAlgorithms<Module> + the Module trait contract).
Proposed Fix
Introduce one templated device-side collector engine (single source of truth) implementing enqueue/pop/switch/flush/record/init-finalize over a Module trait, included by all five AICPU collectors in place of the per-subsystem copies — the device analog of ProfilerAlgorithms<Module>. Mechanical, onboard-validated, preserving every barrier. Land after #1162 (which aligned the helper shapes and is the natural precursor).
Priority
Low (no impact today, good to fix eventually)
Related:
Category
Technical Debt (cleanup, refactor)
Component
Platform (a2a3 / a2a3sim)
Description
The host side of the profiling framework is unified:
ProfilerBase(poll/drain/collect loops) +BufferPoolManager(pools/queues) +ProfilerAlgorithms<Module>(the generic algorithm), one implementation for all five subsystems, each supplying a smallModuletrait. The AICPU (device) side has no such layer — each of L2Swimlane / PMU / DepGen / TensorDump / ScopeStats reimplements the same device-side logic in its own*_aicpu.cpp(per arch for the three arch-specific ones), 400–1000 lines each.This issue is the device-side analog of the host unification: hoist the shared AICPU operation layer into one templated engine + per-subsystem trait. (An earlier framing scoped this to just the two low-level wait helpers; the real duplication is the whole operation layer above them.)
The operation layer is structurally identical across all five, and each copy has drifted. Every collector has an
enqueue_*_ready_buffer(ready-queue push) and aswitch_*_buffer(buffer rotation), and the switch skeleton is the same everywhere:The underlying queue layouts are already identical — the ready header exposes
queue_heads[]/queue_tails[]/queues[][]and the free-queue exposeshead/tail/buffer_ptrs[], differing only in type name (PmuDataHeader/DepGenDataHeader/ …). Because there is no shared engine, the copies have drifted: the backpressure poll-mask, trailingwmb(), top-of-looprmb(), and null-slot handling all differed per copy until #1162 aligned them one file at a time — and TensorDump'sswitch_dump_meta_bufferstill uses an inlineDUMP_SPIN_WAIT_LIMITspin instead of thewait_for_free_queue_entryhelper the others adopted, a live example of the drift. Every future device-side change has to be made 5+ times and kept in sync by hand.Proposed structure
Moduletrait)enqueue_ready(buffer_ptr, seq)(ready-queue push),pop_free()(free-queue pop-and-install),switch_buffer()(the enqueue-full + pop-fresh + install skeleton above),flush()(flush the partial current buffer at teardown),record(...)(append + switch-if-full hot path), and the init/finalize skeleton (set header/pool pointers, reset state)*_READYQUEUE_SIZE/*_SLOT_COUNT/ backpressure-cycle constants; the record-field-store hook; the drop-accounting hook; the instance shape (single-instance vs per-thread)Each
*_aicpu.cppthen collapses to a trait plus a few subsystem-specific hooks, instead of a full re-implementation of enqueue/pop/switch/flush/commit. This also erases the current per-copy drift (the TensorDump spin, the divergent logging, the differing drop-accounting) by construction.Caveats
flush_phase_pool/switch_phase_buffer_kind. The single-instance subsystems (DepGen, ScopeStats) and simple per-thread ones (PMU, TensorDump) collapse cleanly into the engine; L2's multi-pool structure has to compose several engine instances or keep some subsystem-specific code on top. Don't force it into the base if it distorts the common case.rmb()/wmb()must be preserved exactly; sim does not exercise weak-memory reordering, so this must be onboard-validated on both arches.Location
(Symbols only — files are a moving target.)
src/{a2a3,a5}/platform/shared/aicpu/{l2_swimlane_collector,pmu_collector,dep_gen_collector}_aicpu.cpp—enqueue_*_ready_buffer,switch_*_buffer/switch_records_buffer/switch_phase_buffer_kind,flush_phase_pool,try_pop_*.src/common/platform/shared/aicpu/{tensor_dump_aicpu,scope_stats_collector_aicpu}.cpp—enqueue_*,switch_dump_meta_buffer(note the inlineDUMP_SPIN_WAIT_LIMIT),switch_buffer.src/common/platform/include/aicpu/.src/common/platform/include/host/profiler_base.h(ProfilerAlgorithms<Module>+ theModuletrait contract).Proposed Fix
Introduce one templated device-side collector engine (single source of truth) implementing enqueue/pop/switch/flush/record/init-finalize over a
Moduletrait, included by all five AICPU collectors in place of the per-subsystem copies — the device analog ofProfilerAlgorithms<Module>. Mechanical, onboard-validated, preserving every barrier. Land after #1162 (which aligned the helper shapes and is the natural precursor).Priority
Low (no impact today, good to fix eventually)
Related: